| Message ID | 20260120083952.15338-1-jacopo.mondi@ideasonboard.com |
|---|---|
| State | New |
| Headers | show |
| Series |
|
| Related | show |
Hi Jacopo, Quoting Jacopo Mondi (2026-01-20 09:39:49) > Converting numbers with a signed fixed-point representation to > the corresponding float value requires to include the sign bit in the > width of the fixed-point integral part. > > Clearly specify it in documentation. > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > --- > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > 1 file changed, 21 insertions(+), 1 deletion(-) > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > index 6b698fc5d680..b37cdc43936f 100644 > --- a/src/ipa/libipa/fixedpoint.cpp > +++ b/src/ipa/libipa/fixedpoint.cpp > @@ -29,11 +29,31 @@ namespace ipa { > /** > * \fn R fixedToFloatingPoint(T number) > * \brief Convert a fixed-point number to a floating point representation > - * \tparam I Bit width of the integer part of the fixed-point > + * \tparam I Bit width of the integer part of the fixed-point including the > + * optional sign bit > * \tparam F Bit width of the fractional part of the fixed-point > * \tparam R Return type of the floating point representation > * \tparam T Input type of the fixed-point representation > * \param number The fixed point number to convert to floating point > + * > + * If the fixed-point representation is signed, the sign bit shall be included > + * in the \a I template parameter that specifies the number of bits of the > + * integral part of the fixed-point representation. > + * > + * As an example, a value represented as signed fixed-point Q4.8 format can be > + * converted to its corresponding floating point representation as: I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of the 4 is the sign bit? The same way a signed int32 has the signed bit on the first of the 32 bits? Best regards, Stefan > + * > + * \code{.cpp} > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > + * \endcode > + * > + * While a value represented as unsigned fixed-point Q4.8 format can be > + * converted as: > + * > + * \code{.cpp} > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > + * \endcode > + * > * \return The converted value > */ > > -- > 2.52.0 >
Hi Stefan On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote: > Hi Jacopo, > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > Converting numbers with a signed fixed-point representation to > > the corresponding float value requires to include the sign bit in the > > width of the fixed-point integral part. > > > > Clearly specify it in documentation. > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > --- > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > index 6b698fc5d680..b37cdc43936f 100644 > > --- a/src/ipa/libipa/fixedpoint.cpp > > +++ b/src/ipa/libipa/fixedpoint.cpp > > @@ -29,11 +29,31 @@ namespace ipa { > > /** > > * \fn R fixedToFloatingPoint(T number) > > * \brief Convert a fixed-point number to a floating point representation > > - * \tparam I Bit width of the integer part of the fixed-point > > + * \tparam I Bit width of the integer part of the fixed-point including the > > + * optional sign bit > > * \tparam F Bit width of the fractional part of the fixed-point > > * \tparam R Return type of the floating point representation > > * \tparam T Input type of the fixed-point representation > > * \param number The fixed point number to convert to floating point > > + * > > + * If the fixed-point representation is signed, the sign bit shall be included > > + * in the \a I template parameter that specifies the number of bits of the > > + * integral part of the fixed-point representation. > > + * > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > + * converted to its corresponding floating point representation as: > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > the 4 is the sign bit? The same way a signed int32 has the signed bit on > the first of the 32 bits? I'm right now looking at the datasheet documentation of a value said to be in "signed Q4.8" format whose register size is 13 bits Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point > > Best regards, > Stefan > > > + * > > + * \code{.cpp} > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > + * \endcode > > + * > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > + * converted as: > > + * > > + * \code{.cpp} > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > + * \endcode > > + * > > * \return The converted value > > */ > > > > -- > > 2.52.0 > >
Hi Jacopo, Quoting Jacopo Mondi (2026-01-20 10:00:14) > Hi Stefan > > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote: > > Hi Jacopo, > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > Converting numbers with a signed fixed-point representation to > > > the corresponding float value requires to include the sign bit in the > > > width of the fixed-point integral part. > > > > > > Clearly specify it in documentation. > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > --- > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > index 6b698fc5d680..b37cdc43936f 100644 > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > @@ -29,11 +29,31 @@ namespace ipa { > > > /** > > > * \fn R fixedToFloatingPoint(T number) > > > * \brief Convert a fixed-point number to a floating point representation > > > - * \tparam I Bit width of the integer part of the fixed-point > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > + * optional sign bit > > > * \tparam F Bit width of the fractional part of the fixed-point > > > * \tparam R Return type of the floating point representation > > > * \tparam T Input type of the fixed-point representation > > > * \param number The fixed point number to convert to floating point > > > + * > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > + * in the \a I template parameter that specifies the number of bits of the > > > + * integral part of the fixed-point representation. > > > + * > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > + * converted to its corresponding floating point representation as: > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > the first of the 32 bits? > > I'm right now looking at the datasheet documentation of a value said > to be in "signed Q4.8" format whose register size is 13 bits > > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point I should have consulted wikipedia first. https://en.wikipedia.org/wiki/Q_(number_format) clearly states that the sign bit is implicitely added. Best regards, Stefan > > > > > Best regards, > > Stefan > > > > > + * > > > + * \code{.cpp} > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > + * \endcode > > > + * > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > + * converted as: > > > + * > > > + * \code{.cpp} > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > + * \endcode > > > + * > > > * \return The converted value > > > */ > > > > > > -- > > > 2.52.0 > > >
2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta: > Hi Stefan > > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote: >> Hi Jacopo, >> >> Quoting Jacopo Mondi (2026-01-20 09:39:49) >>> Converting numbers with a signed fixed-point representation to >>> the corresponding float value requires to include the sign bit in the >>> width of the fixed-point integral part. >>> >>> Clearly specify it in documentation. >>> >>> Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> >>> --- >>> src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- >>> 1 file changed, 21 insertions(+), 1 deletion(-) >>> >>> diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp >>> index 6b698fc5d680..b37cdc43936f 100644 >>> --- a/src/ipa/libipa/fixedpoint.cpp >>> +++ b/src/ipa/libipa/fixedpoint.cpp >>> @@ -29,11 +29,31 @@ namespace ipa { >>> /** >>> * \fn R fixedToFloatingPoint(T number) >>> * \brief Convert a fixed-point number to a floating point representation >>> - * \tparam I Bit width of the integer part of the fixed-point >>> + * \tparam I Bit width of the integer part of the fixed-point including the >>> + * optional sign bit >>> * \tparam F Bit width of the fractional part of the fixed-point >>> * \tparam R Return type of the floating point representation >>> * \tparam T Input type of the fixed-point representation >>> * \param number The fixed point number to convert to floating point >>> + * >>> + * If the fixed-point representation is signed, the sign bit shall be included >>> + * in the \a I template parameter that specifies the number of bits of the >>> + * integral part of the fixed-point representation. >>> + * >>> + * As an example, a value represented as signed fixed-point Q4.8 format can be >>> + * converted to its corresponding floating point representation as: >> >> I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of >> the 4 is the sign bit? The same way a signed int32 has the signed bit on >> the first of the 32 bits? It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format) "Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n." "ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit." > > I'm right now looking at the datasheet documentation of a value said > to be in "signed Q4.8" format whose register size is 13 bits > > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ? If so, then I'm not sure these functions will work. > >> >> Best regards, >> Stefan >> >>> + * >>> + * \code{.cpp} >>> + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); >>> + * \endcode >>> + * >>> + * While a value represented as unsigned fixed-point Q4.8 format can be >>> + * converted as: >>> + * >>> + * \code{.cpp} >>> + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); >>> + * \endcode >>> + * >>> * \return The converted value >>> */ >>> >>> -- >>> 2.52.0 >>>
Hi Barnabás On Tue, Jan 20, 2026 at 10:11:10AM +0100, Barnabás Pőcze wrote: > 2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta: > > Hi Stefan > > > > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote: > > > Hi Jacopo, > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > Converting numbers with a signed fixed-point representation to > > > > the corresponding float value requires to include the sign bit in the > > > > width of the fixed-point integral part. > > > > > > > > Clearly specify it in documentation. > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > --- > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > /** > > > > * \fn R fixedToFloatingPoint(T number) > > > > * \brief Convert a fixed-point number to a floating point representation > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > + * optional sign bit > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > * \tparam R Return type of the floating point representation > > > > * \tparam T Input type of the fixed-point representation > > > > * \param number The fixed point number to convert to floating point > > > > + * > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > + * integral part of the fixed-point representation. > > > > + * > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > + * converted to its corresponding floating point representation as: > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > the first of the 32 bits? > > It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format) > > "Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n." > "ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit." > > > > > > I'm right now looking at the datasheet documentation of a value said > > to be in "signed Q4.8" format whose register size is 13 bits > > > > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point > > Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ? > If so, then I'm not sure these functions will work. I had just told Stefan "I'm not sure I acutally know what 'magnitude' implies there", and I didn't :) So, I had a bit of read around, including Kieran's Quantized type series and I fell into a too familiarly deep rabbit hole. --------------------- TL;DR ----------------------------------------------- Feel free to skip, these are mostly notes to clarify my understanding --------------------------------------------------------------------------- Let's look at floatingToFixedPoint() remembering that f = float value q = value in Q<m,n> f = q / 2^n q = f * 2^n And that's what floatingToFixedPoint() does template<unsigned int I, unsigned int F, typename R, typename T> constexpr R floatingToFixedPoint(T number) { static_assert(sizeof(int) >= sizeof(R)); static_assert(I + F <= sizeof(R) * 8); R mask = (1 << (F + I)) - 1; R frac = static_cast<R>(static_cast<int>(std::round(number * (1 << F)))) & mask; return frac; } wich can be summarized as (n * 2^n & mask) All good, but how is this handled if floatingToFixedPoint<>() is called as: block->gain01 = floatingToFixedPoint<4, 8, uint16_t, double>(1.0); uint16_t frac = static_cast<uint16_t>( static_cast<int>(std::round(1.0 * 2^8)) & mask; 1.0 * 1^8 is a double calling std::round(double) picks the right overload and returns a double the double is cast to int. The C standard doesn't impose a representation for signed integers and allows it to be either sign/magnitude, 1-complement or 2-complement. It's fair to assume 2-complement is the standard and the C23 standard makes it so. So, on a 64 bits platform we have the result of (1.0 * 2^8) represented as a signed 64-bit integers in 2-complement. According to the C standard, to cast a signed int to an unsigned int "When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged. Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type." As vague as it might sound to me -43 + 2^16 = 65493 = 1111 1111 1101 0101 which in 2-complement is ... -43 Amazing, let's start from the beginning. I want to write to a 13 bits register the number -1.45 in signed Q<4.8> format: uint16_t q = floatingToFixedPoint<4, 8, uint16_t, double>(-1.45); std::round(-1.45 * 2^8) = -371 static_cast<int>(-371) is stored as 2-complement in 64 bits static_cast<int16_t>(-371) = -371 + 2^16 = 65165 65165 = 1111 1110 1000 1101 if we interpret this as a register value in Q<4,8> signed format xx11 1110 1000 1101 1 is the sign bit so let's calculate the 2 complement of 0 1110 1000 1101 = ~(1110 1000 1101) + 1 = = 0001 0111 0010 + 1 = 370 + 1 = 371 Amazing! --------------------- End TL;DR ------------------------------------------- Now, I want this in sign/magnitude. I bet there are smarter ways of doing this but if I simply take the result of floatingToFixedPoint() and check the sign bit, I can simply add it back to absolute value of the result ? As a bit of pseudo code int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; uint16_t res += std::abs(reg); if (reg < 0) res |= BIT(13); I think this could be surely optimized and nicely made a Traits that can be added to the Quantized series Kieran is working on. > > > > > > > > > > Best regards, > > > Stefan > > > > > > > + * > > > > + * \code{.cpp} > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > + * \endcode > > > > + * > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > + * converted as: > > > > + * > > > > + * \code{.cpp} > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > + * \endcode > > > > + * > > > > * \return The converted value > > > > */ > > > > > > > > -- > > > > 2.52.0 > > > > >
On Tue, Jan 20, 2026 at 08:26:29PM +0100, Jacopo Mondi wrote: > On Tue, Jan 20, 2026 at 10:11:10AM +0100, Barnabás Pőcze wrote: > > 2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta: > > > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote: > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > Converting numbers with a signed fixed-point representation to > > > > > the corresponding float value requires to include the sign bit in the > > > > > width of the fixed-point integral part. > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > --- > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > /** > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > + * optional sign bit > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > * \tparam R Return type of the floating point representation > > > > > * \tparam T Input type of the fixed-point representation > > > > > * \param number The fixed point number to convert to floating point > > > > > + * > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > + * integral part of the fixed-point representation. > > > > > + * > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > the first of the 32 bits? > > > > It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format) > > > > "Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n." > > "ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit." > > > > > I'm right now looking at the datasheet documentation of a value said > > > to be in "signed Q4.8" format whose register size is 13 bits > > > > > > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point > > > > Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ? > > If so, then I'm not sure these functions will work. > > I had just told Stefan "I'm not sure I acutally know what 'magnitude' > implies there", and I didn't :) > > So, I had a bit of read around, including Kieran's Quantized type > series and I fell into a too familiarly deep rabbit hole. > > --------------------- TL;DR ----------------------------------------------- > Feel free to skip, these are mostly notes to clarify my understanding > --------------------------------------------------------------------------- > > Let's look at floatingToFixedPoint() remembering that > > f = float value > q = value in Q<m,n> > > f = q / 2^n > q = f * 2^n > > And that's what floatingToFixedPoint() does > > template<unsigned int I, unsigned int F, typename R, typename T> > constexpr R floatingToFixedPoint(T number) > { > static_assert(sizeof(int) >= sizeof(R)); > static_assert(I + F <= sizeof(R) * 8); > > R mask = (1 << (F + I)) - 1; > R frac = static_cast<R>(static_cast<int>(std::round(number * (1 << F)))) & mask; > > return frac; > } > > wich can be summarized as (n * 2^n & mask) > > All good, but how is this handled if floatingToFixedPoint<>() is > called as: > block->gain01 = floatingToFixedPoint<4, 8, uint16_t, double>(1.0); > > uint16_t frac = static_cast<uint16_t>( > static_cast<int>(std::round(1.0 * 2^8)) & mask; > > 1.0 * 1^8 is a double > calling std::round(double) picks the right overload and returns a > double > > the double is cast to int. The C standard doesn't impose a > representation for signed integers and allows it to be either > sign/magnitude, 1-complement or 2-complement. It's fair to assume > 2-complement is the standard and the C23 standard makes it so. > > So, on a 64 bits platform we have the result of (1.0 * 2^8) > represented as a signed 64-bit integers in 2-complement. > > According to the C standard, to cast a signed int to an unsigned int > > "When a value with integer type is converted to another integer type > other than _Bool, if the value can be represented by the new type, it > is unchanged. > > Otherwise, if the new type is unsigned, the value is converted by > repeatedly adding or subtracting one more than the maximum value that > can be represented in the new type until the value is in the range of > the new type." > > As vague as it might sound to me > > -43 + 2^16 = 65493 = 1111 1111 1101 0101 > which in 2-complement is ... -43 > > Amazing, let's start from the beginning. > > I want to write to a 13 bits register the number -1.45 in signed Q<4.8> > format: > > uint16_t q = floatingToFixedPoint<4, 8, uint16_t, double>(-1.45); > > std::round(-1.45 * 2^8) = -371 > > static_cast<int>(-371) is stored as 2-complement in 64 bits > static_cast<int16_t>(-371) = -371 + 2^16 = 65165 > > 65165 = 1111 1110 1000 1101 > > if we interpret this as a register value in Q<4,8> signed > format > > xx11 1110 1000 1101 > > 1 is the sign bit so let's calculate the 2 complement of > 0 1110 1000 1101 = ~(1110 1000 1101) + 1 = > = 0001 0111 0010 + 1 = 370 + 1 = 371 > > Amazing! > > --------------------- End TL;DR ------------------------------------------- > > Now, I want this in sign/magnitude. I bet there are smarter ways of > doing this but if I simply take the result of floatingToFixedPoint() > and check the sign bit, I can simply add it back to absolute value of > the result ? > > As a bit of pseudo code > > int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; > uint16_t res += std::abs(reg); > if (reg < 0) > res |= BIT(13); > > I think this could be surely optimized and nicely made a Traits that > can be added to the Quantized series Kieran is working on. I think you should first test to see if "sign-magnitude" mentioned in the datasheet actually means that, or if it's a signed fixed-point value. If it's the former we'll see how to support it. > > > > > + * > > > > > + * \code{.cpp} > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > + * \endcode > > > > > + * > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > + * converted as: > > > > > + * > > > > > + * \code{.cpp} > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > + * \endcode > > > > > + * > > > > > * \return The converted value > > > > > */
Hi Jacopo, Quoting Stefan Klug (2026-01-20 08:53:06) > Hi Jacopo, > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > Converting numbers with a signed fixed-point representation to > > the corresponding float value requires to include the sign bit in the > > width of the fixed-point integral part. > > > > Clearly specify it in documentation. > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > --- > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > index 6b698fc5d680..b37cdc43936f 100644 > > --- a/src/ipa/libipa/fixedpoint.cpp > > +++ b/src/ipa/libipa/fixedpoint.cpp > > @@ -29,11 +29,31 @@ namespace ipa { > > /** > > * \fn R fixedToFloatingPoint(T number) > > * \brief Convert a fixed-point number to a floating point representation > > - * \tparam I Bit width of the integer part of the fixed-point > > + * \tparam I Bit width of the integer part of the fixed-point including the > > + * optional sign bit > > * \tparam F Bit width of the fractional part of the fixed-point > > * \tparam R Return type of the floating point representation > > * \tparam T Input type of the fixed-point representation > > * \param number The fixed point number to convert to floating point > > + * > > + * If the fixed-point representation is signed, the sign bit shall be included > > + * in the \a I template parameter that specifies the number of bits of the > > + * integral part of the fixed-point representation. > > + * > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > + * converted to its corresponding floating point representation as: Just to be sure - you know I've got patches to remove all of the above that I want to get merged 'soon' right? Quantized brings in explicit signed/unsigned types through Q<4,8> and UQ<4, 8> types. In the new types Q<I, F> has the sign bit included in 'I'. I can add that explicitly to the documentation in my new series for v6. """ * The sign of the value is determined by the sign of \a T. For signed types, * the number of integer bits includes the sign bit. """ -- Kieran > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > the 4 is the sign bit? The same way a signed int32 has the signed bit on > the first of the 32 bits? > > Best regards, > Stefan > > > + * > > + * \code{.cpp} > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > + * \endcode > > + * > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > + * converted as: > > + * > > + * \code{.cpp} > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > + * \endcode > > + * > > * \return The converted value > > */ > > > > -- > > 2.52.0 > >
Hi Laurent On Tue, Jan 20, 2026 at 11:23:42PM +0200, Laurent Pinchart wrote: > On Tue, Jan 20, 2026 at 08:26:29PM +0100, Jacopo Mondi wrote: > > On Tue, Jan 20, 2026 at 10:11:10AM +0100, Barnabás Pőcze wrote: > > > 2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta: > > > > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote: > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > > Converting numbers with a signed fixed-point representation to > > > > > > the corresponding float value requires to include the sign bit in the > > > > > > width of the fixed-point integral part. > > > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > > --- > > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > > /** > > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > > + * optional sign bit > > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > > * \tparam R Return type of the floating point representation > > > > > > * \tparam T Input type of the fixed-point representation > > > > > > * \param number The fixed point number to convert to floating point > > > > > > + * > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > > + * integral part of the fixed-point representation. > > > > > > + * > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > > the first of the 32 bits? > > > > > > It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format) > > > > > > "Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n." > > > "ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit." > > > > > > > I'm right now looking at the datasheet documentation of a value said > > > > to be in "signed Q4.8" format whose register size is 13 bits > > > > > > > > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point > > > > > > Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ? > > > If so, then I'm not sure these functions will work. > > > > I had just told Stefan "I'm not sure I acutally know what 'magnitude' > > implies there", and I didn't :) > > > > So, I had a bit of read around, including Kieran's Quantized type > > series and I fell into a too familiarly deep rabbit hole. > > > > --------------------- TL;DR ----------------------------------------------- > > Feel free to skip, these are mostly notes to clarify my understanding > > --------------------------------------------------------------------------- > > > > Let's look at floatingToFixedPoint() remembering that > > > > f = float value > > q = value in Q<m,n> > > > > f = q / 2^n > > q = f * 2^n > > > > And that's what floatingToFixedPoint() does > > > > template<unsigned int I, unsigned int F, typename R, typename T> > > constexpr R floatingToFixedPoint(T number) > > { > > static_assert(sizeof(int) >= sizeof(R)); > > static_assert(I + F <= sizeof(R) * 8); > > > > R mask = (1 << (F + I)) - 1; > > R frac = static_cast<R>(static_cast<int>(std::round(number * (1 << F)))) & mask; > > > > return frac; > > } > > > > wich can be summarized as (n * 2^n & mask) > > > > All good, but how is this handled if floatingToFixedPoint<>() is > > called as: > > block->gain01 = floatingToFixedPoint<4, 8, uint16_t, double>(1.0); > > > > uint16_t frac = static_cast<uint16_t>( > > static_cast<int>(std::round(1.0 * 2^8)) & mask; > > > > 1.0 * 1^8 is a double > > calling std::round(double) picks the right overload and returns a > > double > > > > the double is cast to int. The C standard doesn't impose a > > representation for signed integers and allows it to be either > > sign/magnitude, 1-complement or 2-complement. It's fair to assume > > 2-complement is the standard and the C23 standard makes it so. > > > > So, on a 64 bits platform we have the result of (1.0 * 2^8) > > represented as a signed 64-bit integers in 2-complement. > > > > According to the C standard, to cast a signed int to an unsigned int > > > > "When a value with integer type is converted to another integer type > > other than _Bool, if the value can be represented by the new type, it > > is unchanged. > > > > Otherwise, if the new type is unsigned, the value is converted by > > repeatedly adding or subtracting one more than the maximum value that > > can be represented in the new type until the value is in the range of > > the new type." > > > > As vague as it might sound to me > > > > -43 + 2^16 = 65493 = 1111 1111 1101 0101 > > which in 2-complement is ... -43 > > > > Amazing, let's start from the beginning. > > > > I want to write to a 13 bits register the number -1.45 in signed Q<4.8> > > format: > > > > uint16_t q = floatingToFixedPoint<4, 8, uint16_t, double>(-1.45); > > > > std::round(-1.45 * 2^8) = -371 > > > > static_cast<int>(-371) is stored as 2-complement in 64 bits > > static_cast<int16_t>(-371) = -371 + 2^16 = 65165 > > > > 65165 = 1111 1110 1000 1101 > > > > if we interpret this as a register value in Q<4,8> signed > > format > > > > xx11 1110 1000 1101 > > > > 1 is the sign bit so let's calculate the 2 complement of > > 0 1110 1000 1101 = ~(1110 1000 1101) + 1 = > > = 0001 0111 0010 + 1 = 370 + 1 = 371 > > > > Amazing! > > > > --------------------- End TL;DR ------------------------------------------- > > > > Now, I want this in sign/magnitude. I bet there are smarter ways of > > doing this but if I simply take the result of floatingToFixedPoint() > > and check the sign bit, I can simply add it back to absolute value of > > the result ? > > > > As a bit of pseudo code > > > > int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; > > uint16_t res += std::abs(reg); > > if (reg < 0) > > res |= BIT(13); > > > > I think this could be surely optimized and nicely made a Traits that > > can be added to the Quantized series Kieran is working on. > > I think you should first test to see if "sign-magnitude" mentioned in > the datasheet actually means that, or if it's a signed fixed-point > value. If it's the former we'll see how to support it. Consider that other registers are said to be: - unsigned 4.8-bit fixed-poin or - signed (2's complement) 11-bit integer While these are specifically described as: - sign/magnitude 4.8-bit fixed-point I would tend to believe it actually is correct, but I can check with the vendor maybe > > > > > > > + * > > > > > > + * \code{.cpp} > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > > + * \endcode > > > > > > + * > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > > + * converted as: > > > > > > + * > > > > > > + * \code{.cpp} > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > > + * \endcode > > > > > > + * > > > > > > * \return The converted value > > > > > > */ > > -- > Regards, > > Laurent Pinchart
Hi Kieran On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > Hi Jacopo, > > Quoting Stefan Klug (2026-01-20 08:53:06) > > Hi Jacopo, > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > Converting numbers with a signed fixed-point representation to > > > the corresponding float value requires to include the sign bit in the > > > width of the fixed-point integral part. > > > > > > Clearly specify it in documentation. > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > --- > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > index 6b698fc5d680..b37cdc43936f 100644 > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > @@ -29,11 +29,31 @@ namespace ipa { > > > /** > > > * \fn R fixedToFloatingPoint(T number) > > > * \brief Convert a fixed-point number to a floating point representation > > > - * \tparam I Bit width of the integer part of the fixed-point > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > + * optional sign bit > > > * \tparam F Bit width of the fractional part of the fixed-point > > > * \tparam R Return type of the floating point representation > > > * \tparam T Input type of the fixed-point representation > > > * \param number The fixed point number to convert to floating point > > > + * > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > + * in the \a I template parameter that specifies the number of bits of the > > > + * integral part of the fixed-point representation. > > > + * > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > + * converted to its corresponding floating point representation as: > > Just to be sure - you know I've got patches to remove all of the above > that I want to get merged 'soon' right? Read the last bit of my reply from yesterday :) > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > UQ<4, 8> types. What is the difference between signed and unsigned ? Is it only the sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > In the new types Q<I, F> has the sign bit included in 'I'. > I can add that explicitly to the documentation in my new series for v6. Well, maybe we need two traits ? https://en.wikipedia.org/wiki/Q_(number_format) Texas Instruments version: The first bit always gives the sign of the value (1 = negative, 0 = non-negative), and it is not counted in the m parameter. Thus, the total number w of bits used is 1 + m + n. ARM Version: A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit I guess the only way to know which one is meant to be used is to actually look at the register sizes. If a Q<4,8> number is stored as a 13 bit fields, then the TI version is used. I wonder how common the ARM version is. > > > """ > * The sign of the value is determined by the sign of \a T. For signed types, > * the number of integer bits includes the sign bit. > """ > > -- > Kieran > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > the first of the 32 bits? > > > > Best regards, > > Stefan > > > > > + * > > > + * \code{.cpp} > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > + * \endcode > > > + * > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > + * converted as: > > > + * > > > + * \code{.cpp} > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > + * \endcode > > > + * > > > * \return The converted value > > > */ > > > > > > -- > > > 2.52.0 > > >
Quoting Jacopo Mondi (2026-01-21 12:53:49) > Hi Kieran > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > Hi Jacopo, > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > Hi Jacopo, > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > Converting numbers with a signed fixed-point representation to > > > > the corresponding float value requires to include the sign bit in the > > > > width of the fixed-point integral part. > > > > > > > > Clearly specify it in documentation. > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > --- > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > /** > > > > * \fn R fixedToFloatingPoint(T number) > > > > * \brief Convert a fixed-point number to a floating point representation > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > + * optional sign bit > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > * \tparam R Return type of the floating point representation > > > > * \tparam T Input type of the fixed-point representation > > > > * \param number The fixed point number to convert to floating point > > > > + * > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > + * integral part of the fixed-point representation. > > > > + * > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > + * converted to its corresponding floating point representation as: > > > > Just to be sure - you know I've got patches to remove all of the above > > that I want to get merged 'soon' right? > > Read the last bit of my reply from yesterday :) > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > UQ<4, 8> types. > > What is the difference between signed and unsigned ? Is it only the > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] Please take a look through the tests I've added: https://patchwork.libcamera.org/patch/25801/ /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ It's easy to extend that if you have specific Q types you want to use/test. > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > I can add that explicitly to the documentation in my new series for v6. > > > Well, maybe we need two traits ? > https://en.wikipedia.org/wiki/Q_(number_format) > > Texas Instruments version: > The first bit always gives the sign of the value (1 = negative, 0 = > non-negative), and it is not counted in the m parameter. Thus, the > total number w of bits used is 1 + m + n. > > ARM Version: > A variant of the Q notation has been in use by ARM in which the m > number also counts the sign bit Yes, you've definitely got to know which one the hardware is using and expecting. I wouldn't make a new trait for this - if we have to specify we can wrap one in the other if it really helps. -- Kieran > > I guess the only way to know which one is meant to be used is to > actually look at the register sizes. If a Q<4,8> number is stored as > a 13 bit fields, then the TI version is used. I wonder how common the > ARM version is. > > > > > > > """ > > * The sign of the value is determined by the sign of \a T. For signed types, > > * the number of integer bits includes the sign bit. > > """ > > > > -- > > Kieran > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > the first of the 32 bits? > > > > > > Best regards, > > > Stefan > > > > > > > + * > > > > + * \code{.cpp} > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > + * \endcode > > > > + * > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > + * converted as: > > > > + * > > > > + * \code{.cpp} > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > + * \endcode > > > > + * > > > > * \return The converted value > > > > */ > > > > > > > > -- > > > > 2.52.0 > > > >
Hi Kieran On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote: > Quoting Jacopo Mondi (2026-01-21 12:53:49) > > Hi Kieran > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > > Hi Jacopo, > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > > Hi Jacopo, > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > Converting numbers with a signed fixed-point representation to > > > > > the corresponding float value requires to include the sign bit in the > > > > > width of the fixed-point integral part. > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > --- > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > /** > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > + * optional sign bit > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > * \tparam R Return type of the floating point representation > > > > > * \tparam T Input type of the fixed-point representation > > > > > * \param number The fixed point number to convert to floating point > > > > > + * > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > + * integral part of the fixed-point representation. > > > > > + * > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > + * converted to its corresponding floating point representation as: > > > > > > Just to be sure - you know I've got patches to remove all of the above > > > that I want to get merged 'soon' right? > > > > Read the last bit of my reply from yesterday :) > > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > > UQ<4, 8> types. > > > > What is the difference between signed and unsigned ? Is it only the > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > Please take a look through the tests I've added: > > https://patchwork.libcamera.org/patch/25801/ > > /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ > /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ > > /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ > /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ > > It's easy to extend that if you have specific Q types you want to > use/test. Ah yes, for min/max it's defintely useful to have signed/unsigned types > > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > > I can add that explicitly to the documentation in my new series for v6. > > > > > > Well, maybe we need two traits ? > > https://en.wikipedia.org/wiki/Q_(number_format) > > > > Texas Instruments version: > > The first bit always gives the sign of the value (1 = negative, 0 = > > non-negative), and it is not counted in the m parameter. Thus, the > > total number w of bits used is 1 + m + n. > > > > ARM Version: > > A variant of the Q notation has been in use by ARM in which the m > > number also counts the sign bit > > Yes, you've definitely got to know which one the hardware is using and > expecting. I wouldn't make a new trait for this - if we have to specify > we can wrap one in the other if it really helps. I'm not sure, if I'm working with the TI format (which as far as I understand is the most common?) then to have a signed value correctly represented as a Q<4,8> I would have to use Q<5,8> (which is counter-intuitive). I would rather modify the Trait to put the sign in the [m + n + 1] bit. Or are the registers you're working with in ARM format ? (sign in [m + n] position) Thanks j > > -- > Kieran > > > > > > I guess the only way to know which one is meant to be used is to > > actually look at the register sizes. If a Q<4,8> number is stored as > > a 13 bit fields, then the TI version is used. I wonder how common the > > ARM version is. > > > > > > > > > > > """ > > > * The sign of the value is determined by the sign of \a T. For signed types, > > > * the number of integer bits includes the sign bit. > > > """ > > > > > > -- > > > Kieran > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > the first of the 32 bits? > > > > > > > > Best regards, > > > > Stefan > > > > > > > > > + * > > > > > + * \code{.cpp} > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > + * \endcode > > > > > + * > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > + * converted as: > > > > > + * > > > > > + * \code{.cpp} > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > + * \endcode > > > > > + * > > > > > * \return The converted value > > > > > */ > > > > > > > > > > -- > > > > > 2.52.0 > > > > >
Quoting Jacopo Mondi (2026-01-21 15:12:24) > Hi Kieran > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote: > > Quoting Jacopo Mondi (2026-01-21 12:53:49) > > > Hi Kieran > > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > > > Hi Jacopo, > > > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > > > Hi Jacopo, > > > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > > Converting numbers with a signed fixed-point representation to > > > > > > the corresponding float value requires to include the sign bit in the > > > > > > width of the fixed-point integral part. > > > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > > --- > > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > > /** > > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > > + * optional sign bit > > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > > * \tparam R Return type of the floating point representation > > > > > > * \tparam T Input type of the fixed-point representation > > > > > > * \param number The fixed point number to convert to floating point > > > > > > + * > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > > + * integral part of the fixed-point representation. > > > > > > + * > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > Just to be sure - you know I've got patches to remove all of the above > > > > that I want to get merged 'soon' right? > > > > > > Read the last bit of my reply from yesterday :) I still don't get this? > > > > > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > > > UQ<4, 8> types. > > > > > > What is the difference between signed and unsigned ? Is it only the > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > > > Please take a look through the tests I've added: > > > > https://patchwork.libcamera.org/patch/25801/ > > > > /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ > > /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ > > > > /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ > > /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ > > > > It's easy to extend that if you have specific Q types you want to > > use/test. > > Ah yes, for min/max it's defintely useful to have signed/unsigned > types It's not about min/max is useful - it's the very fact that Q and UQ have a distinct range. Q types can go less than zero but still span the same distance, so the top/max is halved, but the step size is the same. > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > > > I can add that explicitly to the documentation in my new series for v6. > > > > > > > > > Well, maybe we need two traits ? > > > https://en.wikipedia.org/wiki/Q_(number_format) > > > > > > Texas Instruments version: > > > The first bit always gives the sign of the value (1 = negative, 0 = > > > non-negative), and it is not counted in the m parameter. Thus, the > > > total number w of bits used is 1 + m + n. > > > > > > ARM Version: > > > A variant of the Q notation has been in use by ARM in which the m > > > number also counts the sign bit > > > > Yes, you've definitely got to know which one the hardware is using and > > expecting. I wouldn't make a new trait for this - if we have to specify > > we can wrap one in the other if it really helps. > > I'm not sure, if I'm working with the TI format (which as far as I > understand is the most common?) then to have a signed value correctly > represented as a Q<4,8> I would have to use Q<5,8> (which is > counter-intuitive). > > I would rather modify the Trait to put the sign in the [m + n + 1] > bit. > > Or are the registers you're working with in ARM format ? (sign in > [m + n] position) That's (include the bit) what the original fixedToFloatingPoint() implementations used, so that's what I've continued with. If you want to distinguish these? How should we represent them? /* All 8 bit storage */ UQ<1, 7> Q<1, 7> Q_TI<0, 7> ? -- Kieran > > Thanks > j > > > > > -- > > Kieran > > > > > > > > > > I guess the only way to know which one is meant to be used is to > > > actually look at the register sizes. If a Q<4,8> number is stored as > > > a 13 bit fields, then the TI version is used. I wonder how common the > > > ARM version is. > > > > > > > > > > > > > > > """ > > > > * The sign of the value is determined by the sign of \a T. For signed types, > > > > * the number of integer bits includes the sign bit. > > > > """ > > > > > > > > -- > > > > Kieran > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > > the first of the 32 bits? > > > > > > > > > > Best regards, > > > > > Stefan > > > > > > > > > > > + * > > > > > > + * \code{.cpp} > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > > + * \endcode > > > > > > + * > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > > + * converted as: > > > > > > + * > > > > > > + * \code{.cpp} > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > > + * \endcode > > > > > > + * > > > > > > * \return The converted value > > > > > > */ > > > > > > > > > > > > -- > > > > > > 2.52.0 > > > > > >
On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote: > Quoting Jacopo Mondi (2026-01-21 15:12:24) > > Hi Kieran > > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote: > > > Quoting Jacopo Mondi (2026-01-21 12:53:49) > > > > Hi Kieran > > > > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > > > > Hi Jacopo, > > > > > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > > > > Hi Jacopo, > > > > > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > > > Converting numbers with a signed fixed-point representation to > > > > > > > the corresponding float value requires to include the sign bit in the > > > > > > > width of the fixed-point integral part. > > > > > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > > > --- > > > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > > > /** > > > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > > > + * optional sign bit > > > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > > > * \tparam R Return type of the floating point representation > > > > > > > * \tparam T Input type of the fixed-point representation > > > > > > > * \param number The fixed point number to convert to floating point > > > > > > > + * > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > > > + * integral part of the fixed-point representation. > > > > > > > + * > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > > > Just to be sure - you know I've got patches to remove all of the above > > > > > that I want to get merged 'soon' right? > > > > > > > > Read the last bit of my reply from yesterday :) > > I still don't get this? > I meant the discussion on sign/magnitude representation sign/magnitude is a different representation of signed integers compared to the de-facto standard 2's complement. It requires to manipulate the result of the float-to-fixed conversion so that we take the absolute value and the sign bit is set in the [m + n + 1] bit ------------------------------------------------------------------------------ As a bit of pseudo code int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; uint16_t res += std::abs(reg); if (reg < 0) res |= BIT(13); I think this could be surely optimized and nicely made a Traits that can be added to the Quantized series Kieran is working on. ------------------------------------------------------------------------------ The above is the pharse I thought it could make you happy: sign/magnitude fixed-point formats can be easily be represented with a Trait on top of your series > > > > > > > > > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > > > > UQ<4, 8> types. > > > > > > > > What is the difference between signed and unsigned ? Is it only the > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > > > > > Please take a look through the tests I've added: > > > > > > https://patchwork.libcamera.org/patch/25801/ > > > > > > /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ > > > /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ > > > > > > /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ > > > /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ > > > > > > It's easy to extend that if you have specific Q types you want to > > > use/test. > > > > Ah yes, for min/max it's defintely useful to have signed/unsigned > > types > > It's not about min/max is useful - it's the very fact that Q and UQ have > a distinct range. Q types can go less than zero but still span the same > distance, so the top/max is halved, but the step size is the same. Yes, min/max and range indeed. > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > > > > I can add that explicitly to the documentation in my new series for v6. > > > > > > > > > > > > Well, maybe we need two traits ? > > > > https://en.wikipedia.org/wiki/Q_(number_format) > > > > > > > > Texas Instruments version: > > > > The first bit always gives the sign of the value (1 = negative, 0 = > > > > non-negative), and it is not counted in the m parameter. Thus, the > > > > total number w of bits used is 1 + m + n. > > > > > > > > ARM Version: > > > > A variant of the Q notation has been in use by ARM in which the m > > > > number also counts the sign bit > > > > > > Yes, you've definitely got to know which one the hardware is using and > > > expecting. I wouldn't make a new trait for this - if we have to specify > > > we can wrap one in the other if it really helps. > > > > I'm not sure, if I'm working with the TI format (which as far as I > > understand is the most common?) then to have a signed value correctly > > represented as a Q<4,8> I would have to use Q<5,8> (which is > > counter-intuitive). > > > > I would rather modify the Trait to put the sign in the [m + n + 1] > > bit. > > > > Or are the registers you're working with in ARM format ? (sign in > > [m + n] position) > > > That's (include the bit) what the original fixedToFloatingPoint() > implementations used, so that's what I've continued with. I see but that doesn't mean it's correct. I read one platform manual the description of a coefficient as "8:0 cc_coeff_0 Coefficient 0 for color space conversion" color conversion coefficients are signed integer values with a 7 bit fractional part; range: [-2…1.992] so if there are 7 fractional bit and the max achievable value is 1.992 it means that the value is in Q<1,7> format as: (1 << (1 + 7)) - 1 / (1 << 7) = 1.999 the register size is 9 bits (see the [8:0] in the register description) so I the sign bit is at location [8]. Am I wrong that I want to obtain this with your model I would have to describe the fixed point representation as Q<2,7> (which doesn't match the datasheet) ? And I guess this really is the difference between UQ<m, n> and Q<m, n> usigned Q has no sign bit and the destination register is of size [m+n] signed Q has a sign bit in position [m+n+1] with the value in 2's complement format and destination register of size [m+n+1] > > If you want to distinguish these? How should we represent them? > > > /* All 8 bit storage */ > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ? > Let's start by deciding what behaviour we want by default maybe.. > -- > Kieran > > > > > Thanks > > j > > > > > > > > -- > > > Kieran > > > > > > > > > > > > > > I guess the only way to know which one is meant to be used is to > > > > actually look at the register sizes. If a Q<4,8> number is stored as > > > > a 13 bit fields, then the TI version is used. I wonder how common the > > > > ARM version is. > > > > > > > > > > > > > > > > > > > """ > > > > > * The sign of the value is determined by the sign of \a T. For signed types, > > > > > * the number of integer bits includes the sign bit. > > > > > """ > > > > > > > > > > -- > > > > > Kieran > > > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > > > the first of the 32 bits? > > > > > > > > > > > > Best regards, > > > > > > Stefan > > > > > > > > > > > > > + * > > > > > > > + * \code{.cpp} > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > > > + * \endcode > > > > > > > + * > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > > > + * converted as: > > > > > > > + * > > > > > > > + * \code{.cpp} > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > > > + * \endcode > > > > > > > + * > > > > > > > * \return The converted value > > > > > > > */ > > > > > > > > > > > > > > -- > > > > > > > 2.52.0 > > > > > > >
On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote: > On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote: > > Quoting Jacopo Mondi (2026-01-21 15:12:24) > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote: > > > > Quoting Jacopo Mondi (2026-01-21 12:53:49) > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > > > > Converting numbers with a signed fixed-point representation to > > > > > > > > the corresponding float value requires to include the sign bit in the > > > > > > > > width of the fixed-point integral part. > > > > > > > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > > > > --- > > > > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > > > > /** > > > > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > > > > + * optional sign bit > > > > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > > > > * \tparam R Return type of the floating point representation > > > > > > > > * \tparam T Input type of the fixed-point representation > > > > > > > > * \param number The fixed point number to convert to floating point > > > > > > > > + * > > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > > > > + * integral part of the fixed-point representation. > > > > > > > > + * > > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > > > > > Just to be sure - you know I've got patches to remove all of the above > > > > > > that I want to get merged 'soon' right? > > > > > > > > > > Read the last bit of my reply from yesterday :) > > > > I still don't get this? > > I meant the discussion on sign/magnitude representation > > sign/magnitude is a different representation of signed integers > compared to the de-facto standard 2's complement. It requires to > manipulate the result of the float-to-fixed conversion so that we take > the absolute value and the sign bit is set in the [m + n + 1] bit > > ------------------------------------------------------------------------------ > As a bit of pseudo code > > int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; > uint16_t res += std::abs(reg); > if (reg < 0) > res |= BIT(13); > > > I think this could be surely optimized and nicely made a Traits that > can be added to the Quantized series Kieran is working on. > ------------------------------------------------------------------------------ > > The above is the pharse I thought it could make you happy: > sign/magnitude fixed-point formats can be easily be represented with a > Trait on top of your series > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > > > > > UQ<4, 8> types. > > > > > > > > > > What is the difference between signed and unsigned ? Is it only the > > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > > > > > > > Please take a look through the tests I've added: > > > > > > > > https://patchwork.libcamera.org/patch/25801/ > > > > > > > > /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ > > > > /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ > > > > > > > > /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ > > > > /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ > > > > > > > > It's easy to extend that if you have specific Q types you want to > > > > use/test. > > > > > > Ah yes, for min/max it's defintely useful to have signed/unsigned > > > types > > > > It's not about min/max is useful - it's the very fact that Q and UQ have > > a distinct range. Q types can go less than zero but still span the same > > distance, so the top/max is halved, but the step size is the same. > > Yes, min/max and range indeed. > > > > > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > > > > > I can add that explicitly to the documentation in my new series for v6. > > > > > > > > > > > > > > > Well, maybe we need two traits ? > > > > > https://en.wikipedia.org/wiki/Q_(number_format) > > > > > > > > > > Texas Instruments version: > > > > > The first bit always gives the sign of the value (1 = negative, 0 = > > > > > non-negative), and it is not counted in the m parameter. Thus, the > > > > > total number w of bits used is 1 + m + n. > > > > > > > > > > ARM Version: > > > > > A variant of the Q notation has been in use by ARM in which the m > > > > > number also counts the sign bit > > > > > > > > Yes, you've definitely got to know which one the hardware is using and > > > > expecting. I wouldn't make a new trait for this - if we have to specify > > > > we can wrap one in the other if it really helps. > > > > > > I'm not sure, if I'm working with the TI format (which as far as I > > > understand is the most common?) then to have a signed value correctly > > > represented as a Q<4,8> I would have to use Q<5,8> (which is > > > counter-intuitive). > > > > > > I would rather modify the Trait to put the sign in the [m + n + 1] > > > bit. > > > > > > Or are the registers you're working with in ARM format ? (sign in > > > [m + n] position) > > > > That's (include the bit) what the original fixedToFloatingPoint() > > implementations used, so that's what I've continued with. > > I see but that doesn't mean it's correct. > > I read one platform manual the description of a coefficient as > > "8:0 cc_coeff_0 Coefficient 0 for color space conversion" > color conversion coefficients are signed integer values with a 7 bit > fractional part; range: [-2…1.992] > > so if there are 7 fractional bit and the max achievable value is 1.992 > it means that the value is in Q<1,7> format as: > > (1 << (1 + 7)) - 1 / (1 << 7) = 1.999 > > the register size is 9 bits (see the [8:0] in the register > description) so I the sign bit is at location [8]. > > Am I wrong that I want to obtain this with your model I would have to > describe the fixed point representation as Q<2,7> (which doesn't match > the datasheet) ? Why doesn't this match the datasheet ? The text you quoted says 7 bits of fractional value (match), 9 bits register field (8:0, matching 2+7), and the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise). > And I guess this really is the difference between UQ<m, n> and Q<m, n> > > usigned Q has no sign bit and the destination register is of size [m+n] > signed Q has a sign bit in position [m+n+1] with the value in 2's > complement format and destination register of size [m+n+1] In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1. > > If you want to distinguish these? How should we represent them? > > > > /* All 8 bit storage */ > > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ? > > Let's start by deciding what behaviour we want by default maybe.. Let's pick one option and stick to it please. Yes, writing Q<4, 12> when a TI datasheet says "Q3.12 value" may be a bit confusing, but it's encoding in the type in one place and the rest of the code doesn't have to think about it. We *could* define device-specific aliases in specific IPA modules if we really wanted, but I wouldn't define multiple types in libipa. > > > > > I guess the only way to know which one is meant to be used is to > > > > > actually look at the register sizes. If a Q<4,8> number is stored as > > > > > a 13 bit fields, then the TI version is used. I wonder how common the > > > > > ARM version is. > > > > > > > > > > > """ > > > > > > * The sign of the value is determined by the sign of \a T. For signed types, > > > > > > * the number of integer bits includes the sign bit. > > > > > > """ > > > > > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > > > > the first of the 32 bits? > > > > > > > > > > > > > > > + * > > > > > > > > + * \code{.cpp} > > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > > > > + * \endcode > > > > > > > > + * > > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > > > > + * converted as: > > > > > > > > + * > > > > > > > > + * \code{.cpp} > > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > > > > + * \endcode > > > > > > > > + * > > > > > > > > * \return The converted value > > > > > > > > */ > > > > > > > >
Hi Laurent On Wed, Jan 21, 2026 at 06:37:55PM +0200, Laurent Pinchart wrote: > On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote: > > On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote: > > > Quoting Jacopo Mondi (2026-01-21 15:12:24) > > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote: > > > > > Quoting Jacopo Mondi (2026-01-21 12:53:49) > > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > > > > > Converting numbers with a signed fixed-point representation to > > > > > > > > > the corresponding float value requires to include the sign bit in the > > > > > > > > > width of the fixed-point integral part. > > > > > > > > > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > > > > > --- > > > > > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > > > > > /** > > > > > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > > > > > + * optional sign bit > > > > > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > > > > > * \tparam R Return type of the floating point representation > > > > > > > > > * \tparam T Input type of the fixed-point representation > > > > > > > > > * \param number The fixed point number to convert to floating point > > > > > > > > > + * > > > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > > > > > + * integral part of the fixed-point representation. > > > > > > > > > + * > > > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > > > > > > > Just to be sure - you know I've got patches to remove all of the above > > > > > > > that I want to get merged 'soon' right? > > > > > > > > > > > > Read the last bit of my reply from yesterday :) > > > > > > I still don't get this? > > > > I meant the discussion on sign/magnitude representation > > > > sign/magnitude is a different representation of signed integers > > compared to the de-facto standard 2's complement. It requires to > > manipulate the result of the float-to-fixed conversion so that we take > > the absolute value and the sign bit is set in the [m + n + 1] bit > > > > ------------------------------------------------------------------------------ > > As a bit of pseudo code > > > > int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; > > uint16_t res += std::abs(reg); > > if (reg < 0) > > res |= BIT(13); > > > > > > I think this could be surely optimized and nicely made a Traits that > > can be added to the Quantized series Kieran is working on. > > ------------------------------------------------------------------------------ > > > > The above is the pharse I thought it could make you happy: > > sign/magnitude fixed-point formats can be easily be represented with a > > Trait on top of your series > > > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > > > > > > UQ<4, 8> types. > > > > > > > > > > > > What is the difference between signed and unsigned ? Is it only the > > > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > > > > > > > > > Please take a look through the tests I've added: > > > > > > > > > > https://patchwork.libcamera.org/patch/25801/ > > > > > > > > > > /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ > > > > > /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ > > > > > > > > > > /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ > > > > > /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ > > > > > > > > > > It's easy to extend that if you have specific Q types you want to > > > > > use/test. > > > > > > > > Ah yes, for min/max it's defintely useful to have signed/unsigned > > > > types > > > > > > It's not about min/max is useful - it's the very fact that Q and UQ have > > > a distinct range. Q types can go less than zero but still span the same > > > distance, so the top/max is halved, but the step size is the same. > > > > Yes, min/max and range indeed. > > > > > > > > > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > > > > > > I can add that explicitly to the documentation in my new series for v6. > > > > > > > > > > > > > > > > > > Well, maybe we need two traits ? > > > > > > https://en.wikipedia.org/wiki/Q_(number_format) > > > > > > > > > > > > Texas Instruments version: > > > > > > The first bit always gives the sign of the value (1 = negative, 0 = > > > > > > non-negative), and it is not counted in the m parameter. Thus, the > > > > > > total number w of bits used is 1 + m + n. > > > > > > > > > > > > ARM Version: > > > > > > A variant of the Q notation has been in use by ARM in which the m > > > > > > number also counts the sign bit > > > > > > > > > > Yes, you've definitely got to know which one the hardware is using and > > > > > expecting. I wouldn't make a new trait for this - if we have to specify > > > > > we can wrap one in the other if it really helps. > > > > > > > > I'm not sure, if I'm working with the TI format (which as far as I > > > > understand is the most common?) then to have a signed value correctly > > > > represented as a Q<4,8> I would have to use Q<5,8> (which is > > > > counter-intuitive). > > > > > > > > I would rather modify the Trait to put the sign in the [m + n + 1] > > > > bit. > > > > > > > > Or are the registers you're working with in ARM format ? (sign in > > > > [m + n] position) > > > > > > That's (include the bit) what the original fixedToFloatingPoint() > > > implementations used, so that's what I've continued with. > > > > I see but that doesn't mean it's correct. > > > > I read one platform manual the description of a coefficient as > > > > "8:0 cc_coeff_0 Coefficient 0 for color space conversion" > > color conversion coefficients are signed integer values with a 7 bit > > fractional part; range: [-2…1.992] > > > > so if there are 7 fractional bit and the max achievable value is 1.992 > > it means that the value is in Q<1,7> format as: > > > > (1 << (1 + 7)) - 1 / (1 << 7) = 1.999 > > > > the register size is 9 bits (see the [8:0] in the register > > description) so I the sign bit is at location [8]. > > > > Am I wrong that I want to obtain this with your model I would have to > > describe the fixed point representation as Q<2,7> (which doesn't match > > the datasheet) ? > > Why doesn't this match the datasheet ? The text you quoted says 7 bits of > fractional value (match), 9 bits register field (8:0, matching 2+7), and > the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise). Ok, this datasheet doesn't specify the value for 'm' but do we agree that if m has to indicate the "integer" part, then it should be 1 and not 2 ? In the same datasheet we also have: 10:0 ct_coeff Values are 11-bit signed fixed-point numbers with 4 bit integer and 7 bit fractional part, ranging from -8 (0x400) to +7.992 (0x3FF)." In this case the value is suggested as Q<4,7> and the register is of 11 bits, so bit[11] is the sign. Datasheets for other platforms clearly say that a signed Q<4,8> format is stored in 13 bits, so I should have to use Q<5,8> to have the sign bit in position [13] I guess I feel like, give the wide variety of option, we should be able to control where the sign bit goes to accommodate different vendors, or even different register formats from the same vendor. > > > And I guess this really is the difference between UQ<m, n> and Q<m, n> > > > > usigned Q has no sign bit and the destination register is of size [m+n] > > signed Q has a sign bit in position [m+n+1] with the value in 2's > > complement format and destination register of size [m+n+1] > > In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1. > > > > If you want to distinguish these? How should we represent them? > > > > > > /* All 8 bit storage */ > > > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ? > > > > Let's start by deciding what behaviour we want by default maybe.. > > Let's pick one option and stick to it please. Yes, writing Q<4, 12> when > a TI datasheet says "Q3.12 value" may be a bit confusing, but it's I'm not sure this is limited by TI, I actually see datasheet from the author of the variant Q format complying with the TI version of the Q format.. So don't assume the "ARM format" is used on ARM platforms and TI format on TI ones.. > encoding in the type in one place and the rest of the code doesn't have > to think about it. > > We *could* define device-specific aliases in specific IPA modules if we > really wanted, but I wouldn't define multiple types in libipa. > > > > > > > I guess the only way to know which one is meant to be used is to > > > > > > actually look at the register sizes. If a Q<4,8> number is stored as > > > > > > a 13 bit fields, then the TI version is used. I wonder how common the > > > > > > ARM version is. > > > > > > > > > > > > > """ > > > > > > > * The sign of the value is determined by the sign of \a T. For signed types, > > > > > > > * the number of integer bits includes the sign bit. > > > > > > > """ > > > > > > > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > > > > > the first of the 32 bits? > > > > > > > > > > > > > > > > > + * > > > > > > > > > + * \code{.cpp} > > > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > > > > > + * \endcode > > > > > > > > > + * > > > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > > > > > + * converted as: > > > > > > > > > + * > > > > > > > > > + * \code{.cpp} > > > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > > > > > + * \endcode > > > > > > > > > + * > > > > > > > > > * \return The converted value > > > > > > > > > */ > > > > > > > > > > > -- > Regards, > > Laurent Pinchart
On Wed, Jan 21, 2026 at 05:54:35PM +0100, Jacopo Mondi wrote: > On Wed, Jan 21, 2026 at 06:37:55PM +0200, Laurent Pinchart wrote: > > On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote: > > > On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote: > > > > Quoting Jacopo Mondi (2026-01-21 15:12:24) > > > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote: > > > > > > Quoting Jacopo Mondi (2026-01-21 12:53:49) > > > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > > > > > > Converting numbers with a signed fixed-point representation to > > > > > > > > > > the corresponding float value requires to include the sign bit in the > > > > > > > > > > width of the fixed-point integral part. > > > > > > > > > > > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > > > > > > --- > > > > > > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > > > > > > /** > > > > > > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > > > > > > + * optional sign bit > > > > > > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > > > > > > * \tparam R Return type of the floating point representation > > > > > > > > > > * \tparam T Input type of the fixed-point representation > > > > > > > > > > * \param number The fixed point number to convert to floating point > > > > > > > > > > + * > > > > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > > > > > > + * integral part of the fixed-point representation. > > > > > > > > > > + * > > > > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > > > > > > > > > Just to be sure - you know I've got patches to remove all of the above > > > > > > > > that I want to get merged 'soon' right? > > > > > > > > > > > > > > Read the last bit of my reply from yesterday :) > > > > > > > > I still don't get this? > > > > > > I meant the discussion on sign/magnitude representation > > > > > > sign/magnitude is a different representation of signed integers > > > compared to the de-facto standard 2's complement. It requires to > > > manipulate the result of the float-to-fixed conversion so that we take > > > the absolute value and the sign bit is set in the [m + n + 1] bit > > > > > > ------------------------------------------------------------------------------ > > > As a bit of pseudo code > > > > > > int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; > > > uint16_t res += std::abs(reg); > > > if (reg < 0) > > > res |= BIT(13); > > > > > > > > > I think this could be surely optimized and nicely made a Traits that > > > can be added to the Quantized series Kieran is working on. > > > ------------------------------------------------------------------------------ > > > > > > The above is the pharse I thought it could make you happy: > > > sign/magnitude fixed-point formats can be easily be represented with a > > > Trait on top of your series > > > > > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > > > > > > > UQ<4, 8> types. > > > > > > > > > > > > > > What is the difference between signed and unsigned ? Is it only the > > > > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > > > > > > > > > > > Please take a look through the tests I've added: > > > > > > > > > > > > https://patchwork.libcamera.org/patch/25801/ > > > > > > > > > > > > /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ > > > > > > /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ > > > > > > > > > > > > /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ > > > > > > /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ > > > > > > > > > > > > It's easy to extend that if you have specific Q types you want to > > > > > > use/test. > > > > > > > > > > Ah yes, for min/max it's defintely useful to have signed/unsigned > > > > > types > > > > > > > > It's not about min/max is useful - it's the very fact that Q and UQ have > > > > a distinct range. Q types can go less than zero but still span the same > > > > distance, so the top/max is halved, but the step size is the same. > > > > > > Yes, min/max and range indeed. > > > > > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > > > > > > > I can add that explicitly to the documentation in my new series for v6. > > > > > > > > > > > > > > > > > > > > > Well, maybe we need two traits ? > > > > > > > https://en.wikipedia.org/wiki/Q_(number_format) > > > > > > > > > > > > > > Texas Instruments version: > > > > > > > The first bit always gives the sign of the value (1 = negative, 0 = > > > > > > > non-negative), and it is not counted in the m parameter. Thus, the > > > > > > > total number w of bits used is 1 + m + n. > > > > > > > > > > > > > > ARM Version: > > > > > > > A variant of the Q notation has been in use by ARM in which the m > > > > > > > number also counts the sign bit > > > > > > > > > > > > Yes, you've definitely got to know which one the hardware is using and > > > > > > expecting. I wouldn't make a new trait for this - if we have to specify > > > > > > we can wrap one in the other if it really helps. > > > > > > > > > > I'm not sure, if I'm working with the TI format (which as far as I > > > > > understand is the most common?) then to have a signed value correctly > > > > > represented as a Q<4,8> I would have to use Q<5,8> (which is > > > > > counter-intuitive). > > > > > > > > > > I would rather modify the Trait to put the sign in the [m + n + 1] > > > > > bit. > > > > > > > > > > Or are the registers you're working with in ARM format ? (sign in > > > > > [m + n] position) > > > > > > > > That's (include the bit) what the original fixedToFloatingPoint() > > > > implementations used, so that's what I've continued with. > > > > > > I see but that doesn't mean it's correct. > > > > > > I read one platform manual the description of a coefficient as > > > > > > "8:0 cc_coeff_0 Coefficient 0 for color space conversion" > > > color conversion coefficients are signed integer values with a 7 bit > > > fractional part; range: [-2…1.992] > > > > > > so if there are 7 fractional bit and the max achievable value is 1.992 > > > it means that the value is in Q<1,7> format as: > > > > > > (1 << (1 + 7)) - 1 / (1 << 7) = 1.999 > > > > > > the register size is 9 bits (see the [8:0] in the register > > > description) so I the sign bit is at location [8]. > > > > > > Am I wrong that I want to obtain this with your model I would have to > > > describe the fixed point representation as Q<2,7> (which doesn't match > > > the datasheet) ? > > > > Why doesn't this match the datasheet ? The text you quoted says 7 bits of > > fractional value (match), 9 bits register field (8:0, matching 2+7), and > > the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise). > > Ok, this datasheet doesn't specify the value for 'm' but do we agree > that if m has to indicate the "integer" part, then it should be 1 and > not 2 ? No :-) If you want a range from -2 to 1.992, the 'm' value given the convention in this series is 2. > In the same datasheet we also have: > > 10:0 ct_coeff > Values are 11-bit signed fixed-point numbers with 4 bit integer and 7 > bit fractional part, ranging from -8 (0x400) to +7.992 (0x3FF)." > > In this case the value is suggested as Q<4,7> and the register is of > 11 bits, so bit[11] is the sign. > > Datasheets for other platforms clearly say that a signed Q<4,8> format > is stored in 13 bits, so I should have to use Q<5,8> to have the sign > bit in position [13] I guess As discussed in this thread, there are multiple conventions. The convention taken in this series is that Q<4, 8> is stored in 12 bits. There's no single convention that will match all documentation ever written, so we should pick one an live with it. I vote for the convention in this series (a.k.a. the ARM convention). > I feel like, give the wide variety of option, we should be able to > control where the sign bit goes to accommodate different vendors, or > even different register formats from the same vendor. > > > > And I guess this really is the difference between UQ<m, n> and Q<m, n> > > > > > > usigned Q has no sign bit and the destination register is of size [m+n] > > > signed Q has a sign bit in position [m+n+1] with the value in 2's > > > complement format and destination register of size [m+n+1] > > > > In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1. > > > > > > If you want to distinguish these? How should we represent them? > > > > > > > > /* All 8 bit storage */ > > > > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ? > > > > > > Let's start by deciding what behaviour we want by default maybe.. > > > > Let's pick one option and stick to it please. Yes, writing Q<4, 12> when > > a TI datasheet says "Q3.12 value" may be a bit confusing, but it's > > I'm not sure this is limited by TI, I actually see datasheet from the > author of the variant Q format complying with the TI version of the Q > format.. So don't assume the "ARM format" is used on ARM platforms and > TI format on TI ones.. > > > encoding in the type in one place and the rest of the code doesn't have > > to think about it. > > > > We *could* define device-specific aliases in specific IPA modules if we > > really wanted, but I wouldn't define multiple types in libipa. > > > > > > > > > I guess the only way to know which one is meant to be used is to > > > > > > > actually look at the register sizes. If a Q<4,8> number is stored as > > > > > > > a 13 bit fields, then the TI version is used. I wonder how common the > > > > > > > ARM version is. > > > > > > > > > > > > > > > """ > > > > > > > > * The sign of the value is determined by the sign of \a T. For signed types, > > > > > > > > * the number of integer bits includes the sign bit. > > > > > > > > """ > > > > > > > > > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > > > > > > the first of the 32 bits? > > > > > > > > > > > > > > > > > > > + * > > > > > > > > > > + * \code{.cpp} > > > > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > > > > > > + * \endcode > > > > > > > > > > + * > > > > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > > > > > > + * converted as: > > > > > > > > > > + * > > > > > > > > > > + * \code{.cpp} > > > > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > > > > > > + * \endcode > > > > > > > > > > + * > > > > > > > > > > * \return The converted value > > > > > > > > > > */ > > > > > > > > > >
Hi Laurent On Wed, Jan 21, 2026 at 08:00:08PM +0200, Laurent Pinchart wrote: > On Wed, Jan 21, 2026 at 05:54:35PM +0100, Jacopo Mondi wrote: > > On Wed, Jan 21, 2026 at 06:37:55PM +0200, Laurent Pinchart wrote: > > > On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote: > > > > On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote: > > > > > Quoting Jacopo Mondi (2026-01-21 15:12:24) > > > > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote: > > > > > > > Quoting Jacopo Mondi (2026-01-21 12:53:49) > > > > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote: > > > > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06) > > > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49) > > > > > > > > > > > Converting numbers with a signed fixed-point representation to > > > > > > > > > > > the corresponding float value requires to include the sign bit in the > > > > > > > > > > > width of the fixed-point integral part. > > > > > > > > > > > > > > > > > > > > > > Clearly specify it in documentation. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> > > > > > > > > > > > --- > > > > > > > > > > > src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- > > > > > > > > > > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > > > index 6b698fc5d680..b37cdc43936f 100644 > > > > > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp > > > > > > > > > > > @@ -29,11 +29,31 @@ namespace ipa { > > > > > > > > > > > /** > > > > > > > > > > > * \fn R fixedToFloatingPoint(T number) > > > > > > > > > > > * \brief Convert a fixed-point number to a floating point representation > > > > > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point > > > > > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the > > > > > > > > > > > + * optional sign bit > > > > > > > > > > > * \tparam F Bit width of the fractional part of the fixed-point > > > > > > > > > > > * \tparam R Return type of the floating point representation > > > > > > > > > > > * \tparam T Input type of the fixed-point representation > > > > > > > > > > > * \param number The fixed point number to convert to floating point > > > > > > > > > > > + * > > > > > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included > > > > > > > > > > > + * in the \a I template parameter that specifies the number of bits of the > > > > > > > > > > > + * integral part of the fixed-point representation. > > > > > > > > > > > + * > > > > > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be > > > > > > > > > > > + * converted to its corresponding floating point representation as: > > > > > > > > > > > > > > > > > > Just to be sure - you know I've got patches to remove all of the above > > > > > > > > > that I want to get merged 'soon' right? > > > > > > > > > > > > > > > > Read the last bit of my reply from yesterday :) > > > > > > > > > > I still don't get this? > > > > > > > > I meant the discussion on sign/magnitude representation > > > > > > > > sign/magnitude is a different representation of signed integers > > > > compared to the de-facto standard 2's complement. It requires to > > > > manipulate the result of the float-to-fixed conversion so that we take > > > > the absolute value and the sign bit is set in the [m + n + 1] bit > > > > > > > > ------------------------------------------------------------------------------ > > > > As a bit of pseudo code > > > > > > > > int reg = static_cast<int>(std::round(number * (1 << F)))) & mask; > > > > uint16_t res += std::abs(reg); > > > > if (reg < 0) > > > > res |= BIT(13); > > > > > > > > > > > > I think this could be surely optimized and nicely made a Traits that > > > > can be added to the Quantized series Kieran is working on. > > > > ------------------------------------------------------------------------------ > > > > > > > > The above is the pharse I thought it could make you happy: > > > > sign/magnitude fixed-point formats can be easily be represented with a > > > > Trait on top of your series > > > > > > > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and > > > > > > > > > UQ<4, 8> types. > > > > > > > > > > > > > > > > What is the difference between signed and unsigned ? Is it only the > > > > > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0] > > > > > > > > > > > > > > Please take a look through the tests I've added: > > > > > > > > > > > > > > https://patchwork.libcamera.org/patch/25801/ > > > > > > > > > > > > > > /* Q1.7(-1 .. 0.992188) Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/ > > > > > > > /* UQ1.7(0 .. 1.99219) Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */ > > > > > > > > > > > > > > /* Q12.4(-2048 .. 2047.94) Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */ > > > > > > > /* UQ12.4(0 .. 4095.94) Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */ > > > > > > > > > > > > > > It's easy to extend that if you have specific Q types you want to > > > > > > > use/test. > > > > > > > > > > > > Ah yes, for min/max it's defintely useful to have signed/unsigned > > > > > > types > > > > > > > > > > It's not about min/max is useful - it's the very fact that Q and UQ have > > > > > a distinct range. Q types can go less than zero but still span the same > > > > > distance, so the top/max is halved, but the step size is the same. > > > > > > > > Yes, min/max and range indeed. > > > > > > > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'. > > > > > > > > > I can add that explicitly to the documentation in my new series for v6. > > > > > > > > > > > > > > > > > > > > > > > > Well, maybe we need two traits ? > > > > > > > > https://en.wikipedia.org/wiki/Q_(number_format) > > > > > > > > > > > > > > > > Texas Instruments version: > > > > > > > > The first bit always gives the sign of the value (1 = negative, 0 = > > > > > > > > non-negative), and it is not counted in the m parameter. Thus, the > > > > > > > > total number w of bits used is 1 + m + n. > > > > > > > > > > > > > > > > ARM Version: > > > > > > > > A variant of the Q notation has been in use by ARM in which the m > > > > > > > > number also counts the sign bit > > > > > > > > > > > > > > Yes, you've definitely got to know which one the hardware is using and > > > > > > > expecting. I wouldn't make a new trait for this - if we have to specify > > > > > > > we can wrap one in the other if it really helps. > > > > > > > > > > > > I'm not sure, if I'm working with the TI format (which as far as I > > > > > > understand is the most common?) then to have a signed value correctly > > > > > > represented as a Q<4,8> I would have to use Q<5,8> (which is > > > > > > counter-intuitive). > > > > > > > > > > > > I would rather modify the Trait to put the sign in the [m + n + 1] > > > > > > bit. > > > > > > > > > > > > Or are the registers you're working with in ARM format ? (sign in > > > > > > [m + n] position) > > > > > > > > > > That's (include the bit) what the original fixedToFloatingPoint() > > > > > implementations used, so that's what I've continued with. > > > > > > > > I see but that doesn't mean it's correct. > > > > > > > > I read one platform manual the description of a coefficient as > > > > > > > > "8:0 cc_coeff_0 Coefficient 0 for color space conversion" > > > > color conversion coefficients are signed integer values with a 7 bit > > > > fractional part; range: [-2…1.992] > > > > > > > > so if there are 7 fractional bit and the max achievable value is 1.992 > > > > it means that the value is in Q<1,7> format as: > > > > > > > > (1 << (1 + 7)) - 1 / (1 << 7) = 1.999 > > > > > > > > the register size is 9 bits (see the [8:0] in the register > > > > description) so I the sign bit is at location [8]. > > > > > > > > Am I wrong that I want to obtain this with your model I would have to > > > > describe the fixed point representation as Q<2,7> (which doesn't match > > > > the datasheet) ? > > > > > > Why doesn't this match the datasheet ? The text you quoted says 7 bits of > > > fractional value (match), 9 bits register field (8:0, matching 2+7), and > > > the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise). > > > > Ok, this datasheet doesn't specify the value for 'm' but do we agree > > that if m has to indicate the "integer" part, then it should be 1 and > > not 2 ? > > No :-) If you want a range from -2 to 1.992, the 'm' value given the > convention in this series is 2. If you count the sign bit, yes > > > In the same datasheet we also have: > > > > 10:0 ct_coeff > > Values are 11-bit signed fixed-point numbers with 4 bit integer and 7 > > bit fractional part, ranging from -8 (0x400) to +7.992 (0x3FF)." > > > > In this case the value is suggested as Q<4,7> and the register is of > > 11 bits, so bit[11] is the sign. > > > > Datasheets for other platforms clearly say that a signed Q<4,8> format > > is stored in 13 bits, so I should have to use Q<5,8> to have the sign > > bit in position [13] I guess > > As discussed in this thread, there are multiple conventions. The > convention taken in this series is that Q<4, 8> is stored in 12 bits. > There's no single convention that will match all documentation ever > written, so we should pick one an live with it. I vote for the > convention in this series (a.k.a. the ARM convention). > Ok, I would have found the TI one more intuitive though As long as it is documented clearly, I'll live with that Thanks j > > I feel like, give the wide variety of option, we should be able to > > control where the sign bit goes to accommodate different vendors, or > > even different register formats from the same vendor. > > > > > > And I guess this really is the difference between UQ<m, n> and Q<m, n> > > > > > > > > usigned Q has no sign bit and the destination register is of size [m+n] > > > > signed Q has a sign bit in position [m+n+1] with the value in 2's > > > > complement format and destination register of size [m+n+1] > > > > > > In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1. > > > > > > > > If you want to distinguish these? How should we represent them? > > > > > > > > > > /* All 8 bit storage */ > > > > > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ? > > > > > > > > Let's start by deciding what behaviour we want by default maybe.. > > > > > > Let's pick one option and stick to it please. Yes, writing Q<4, 12> when > > > a TI datasheet says "Q3.12 value" may be a bit confusing, but it's > > > > I'm not sure this is limited by TI, I actually see datasheet from the > > author of the variant Q format complying with the TI version of the Q > > format.. So don't assume the "ARM format" is used on ARM platforms and > > TI format on TI ones.. > > > > > encoding in the type in one place and the rest of the code doesn't have > > > to think about it. > > > > > > We *could* define device-specific aliases in specific IPA modules if we > > > really wanted, but I wouldn't define multiple types in libipa. > > > > > > > > > > > I guess the only way to know which one is meant to be used is to > > > > > > > > actually look at the register sizes. If a Q<4,8> number is stored as > > > > > > > > a 13 bit fields, then the TI version is used. I wonder how common the > > > > > > > > ARM version is. > > > > > > > > > > > > > > > > > """ > > > > > > > > > * The sign of the value is determined by the sign of \a T. For signed types, > > > > > > > > > * the number of integer bits includes the sign bit. > > > > > > > > > """ > > > > > > > > > > > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of > > > > > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on > > > > > > > > > > the first of the 32 bits? > > > > > > > > > > > > > > > > > > > > > + * > > > > > > > > > > > + * \code{.cpp} > > > > > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); > > > > > > > > > > > + * \endcode > > > > > > > > > > > + * > > > > > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be > > > > > > > > > > > + * converted as: > > > > > > > > > > > + * > > > > > > > > > > > + * \code{.cpp} > > > > > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); > > > > > > > > > > > + * \endcode > > > > > > > > > > > + * > > > > > > > > > > > * \return The converted value > > > > > > > > > > > */ > > > > > > > > > > > > > -- > Regards, > > Laurent Pinchart
diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp index 6b698fc5d680..b37cdc43936f 100644 --- a/src/ipa/libipa/fixedpoint.cpp +++ b/src/ipa/libipa/fixedpoint.cpp @@ -29,11 +29,31 @@ namespace ipa { /** * \fn R fixedToFloatingPoint(T number) * \brief Convert a fixed-point number to a floating point representation - * \tparam I Bit width of the integer part of the fixed-point + * \tparam I Bit width of the integer part of the fixed-point including the + * optional sign bit * \tparam F Bit width of the fractional part of the fixed-point * \tparam R Return type of the floating point representation * \tparam T Input type of the fixed-point representation * \param number The fixed point number to convert to floating point + * + * If the fixed-point representation is signed, the sign bit shall be included + * in the \a I template parameter that specifies the number of bits of the + * integral part of the fixed-point representation. + * + * As an example, a value represented as signed fixed-point Q4.8 format can be + * converted to its corresponding floating point representation as: + * + * \code{.cpp} + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed); + * \endcode + * + * While a value represented as unsigned fixed-point Q4.8 format can be + * converted as: + * + * \code{.cpp} + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed); + * \endcode + * * \return The converted value */
Converting numbers with a signed fixed-point representation to the corresponding float value requires to include the sign bit in the width of the fixed-point integral part. Clearly specify it in documentation. Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> --- src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) -- 2.52.0