[libcamera-devel,1/6] utils: Add function to convert string to UCS-2
diff mbox series

Message ID 20210114104035.302968-2-paul.elder@ideasonboard.com
State Superseded
Delegated to: Paul Elder
Headers show
Series
  • Fill in android result metadata and EXIF tags
Related show

Commit Message

Paul Elder Jan. 14, 2021, 10:40 a.m. UTC
GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but
are recommended to be in UCS-2. Add a function in utils to help with
this.

Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>
---
 include/libcamera/internal/utils.h |  2 ++
 src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

Comments

Jacopo Mondi Jan. 15, 2021, 2:19 p.m. UTC | #1
Hi Paul,

  I read a few things around, but character encoding seems a very
  complex subject, so I mostly have minor comments here

On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but

From what I've read, even referring to Unicode might be mis-leading as
it includes a number of different encodings. Do the EXIF specification
mention Unicode or any other more specific standard ?

> are recommended to be in UCS-2. Add a function in utils to help with
> this.
>
> Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>
> ---
>  include/libcamera/internal/utils.h |  2 ++
>  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> index f08134af..aa9cc236 100644
> --- a/include/libcamera/internal/utils.h
> +++ b/include/libcamera/internal/utils.h
> @@ -35,6 +35,8 @@ const char *basename(const char *path);
>  char *secure_getenv(const char *name);
>  std::string dirname(const std::string &path);
>
> +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> +
>  template<typename T>
>  std::vector<typename T::key_type> map_keys(const T &map)
>  {
> diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> index e90375ae..89cb0f73 100644
> --- a/src/libcamera/utils.cpp
> +++ b/src/libcamera/utils.cpp
> @@ -17,6 +17,7 @@
>  #include <string.h>
>  #include <sys/stat.h>
>  #include <sys/types.h>
> +#include <uchar.h>
>  #include <unistd.h>
>
>  /**
> @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
>  	return path.substr(0, pos + 1);
>  }
>
> +/**
> + * \brief Convert string to byte array of UCS-2

a string to a byte array of UCS-2 encoded code point

But I wonder, the encoding used to represent the characters in the
string I assume depends on some locale, do they ?

> + * \param[in] str String to convert

The string to convert

> + * \param[in] le Little-endian (false for Big-endian)

The desired byte-endianess of the converted byte array.

An enum would not hurt, but it's not strictly required.

> + *
> + * \return Byte array of UCS-2 representation of \a str, without null-terminator

While it is still not clear to me the distinction between UTF-16 and
UCS-2 and I get the two are actually converging over time, the
documentaion of std::mbrtoc16 explicitely mentions UTF-16.

I guess it again depends on the encoding of \a str (which again
depends on the selected locale ?)

> + */
> +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)

I wonder why we use snake_case in utils ? maybe to mimic STL ?

> +{
> +	std::mbstate_t state{};
> +	char16_t c16;
> +	const char *ptr = &str[0], *end = &str[0] + str.size();

One variable per line and maybe
        const char *end = &str.back()
> +
> +	std::vector<uint8_t> ret;

I would reserve str.size() * 2

Even if I get it's not necessarly that every char in str gets expanded
to two bytes

> +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {

std::mbrtoc16 ?
How come the compiler does not complain ?

> +		if (rc == static_cast<size_t>(-2) ||
> +		    rc == static_cast<size_t>(-1))
> +			break;
> +
> +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));

I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
automatically converted, does it ?

> +
> +		if (rc > 0)
> +			ptr += rc;
> +	}
> +
> +	return ret;
> +}
> +
>  /**
>   * \fn std::vector<typename T::key_type> map_keys(const T &map)
>   * \brief Retrieve the keys of a std::map<>
> --
> 2.27.0
>
> _______________________________________________
> libcamera-devel mailing list
> libcamera-devel@lists.libcamera.org
> https://lists.libcamera.org/listinfo/libcamera-devel
Laurent Pinchart Jan. 15, 2021, 5:01 p.m. UTC | #2
Hi Paul,

Thank you for the patch.

On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:
> Hi Paul,
> 
>   I read a few things around, but character encoding seems a very
>   complex subject, so I mostly have minor comments here
> 
> On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but
> 
> From what I've read, even referring to Unicode might be mis-leading as
> it includes a number of different encodings. Do the EXIF specification
> mention Unicode or any other more specific standard ?

The Exif specification is fairly bad in that regard. It only says
"unicode", while what it means (or at least what is used in practice
today, I don't know if that's what was meant originally) is UCS-2. The
endianness is also not mentioned, and we believe it should match the
endianness of the TIFF container, but that's not documented.


> > are recommended to be in UCS-2. Add a function in utils to help with
> > this.

Unicode is a standard that contains (among other things) a set of
characters with their respective code points (a number that identifies
the character) and a set of character encodings. Encodings define how
to encode a code point into bytes. UCS-2 is one particular encoding.
Saying "can be in Unicode but are recommended to be in UCS-2" is thus
not quite correct.

> > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>
> > ---
> >  include/libcamera/internal/utils.h |  2 ++
> >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > index f08134af..aa9cc236 100644
> > --- a/include/libcamera/internal/utils.h
> > +++ b/include/libcamera/internal/utils.h
> > @@ -35,6 +35,8 @@ const char *basename(const char *path);
> >  char *secure_getenv(const char *name);
> >  std::string dirname(const std::string &path);
> >
> > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> > +

I'd move this to the EXIF class in the HAL. There's hardly any chance
that libcamera will need to convert a string to UCS-2.

> >  template<typename T>
> >  std::vector<typename T::key_type> map_keys(const T &map)
> >  {
> > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > index e90375ae..89cb0f73 100644
> > --- a/src/libcamera/utils.cpp
> > +++ b/src/libcamera/utils.cpp
> > @@ -17,6 +17,7 @@
> >  #include <string.h>
> >  #include <sys/stat.h>
> >  #include <sys/types.h>
> > +#include <uchar.h>
> >  #include <unistd.h>
> >
> >  /**
> > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
> >  	return path.substr(0, pos + 1);
> >  }
> >
> > +/**
> > + * \brief Convert string to byte array of UCS-2
> 
> a string to a byte array of UCS-2 encoded code point
> 
> But I wonder, the encoding used to represent the characters in the
> string I assume depends on some locale, do they ?

std::string has no concept of encoding. It's only an array of char. This
function assumes that the input is encoded in UTF-8, and outputs a
UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string
doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text
is valid UTF-16 text.

It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe
to use UTF-16.

> > + * \param[in] str String to convert
> 
> The string to convert
> 
> > + * \param[in] le Little-endian (false for Big-endian)
> 
> The desired byte-endianess of the converted byte array.
> 
> An enum would not hurt, but it's not strictly required.
> 
> > + *
> > + * \return Byte array of UCS-2 representation of \a str, without null-terminator
> 
> While it is still not clear to me the distinction between UTF-16 and
> UCS-2 and I get the two are actually converging over time, the
> documentaion of std::mbrtoc16 explicitely mentions UTF-16.
> 
> I guess it again depends on the encoding of \a str (which again
> depends on the selected locale ?)
> 
> > + */
> > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)
> 
> I wonder why we use snake_case in utils ? maybe to mimic STL ?

For some functions, we reuse names defined elsewhere (for instance our
custom implementation of secure_getenv() for platforms that don't
provide it). For other functions, it's to mimick STL indeed (for
instance set_overlap, there's a std::set_union() and
std::set_intersection()). Then we have duration_to_timespec() and
time_point_to_string() that should probably be in camelCase.

For this function, I'd use cameraCase. The function should likely be
named utf8ToUtf16 or something similar, to make it clear what the source
encoding is.

If we want to keep this function a generator UTF-8 to UTF-16 convertor,
I'd return a std::u16string instead of a std::vector<uint8_t>. This
would mean handling the endianness conversion in the caller. The
alternative is to turn this into a UTF-8 to Exif-Unicode function, in
which case you can keep the le argument, but I would then rename it to
endianness and use the Exif endianness macros as values instead of
making it a bool.

> > +{
> > +	std::mbstate_t state{};
> > +	char16_t c16;
> > +	const char *ptr = &str[0], *end = &str[0] + str.size();
> 
> One variable per line and maybe
>         const char *end = &str.back()

back() isn't the same.

	const char *ptr = str.data();
	const char *end = ptr + str.size();

> > +
> > +	std::vector<uint8_t> ret;
> 
> I would reserve str.size() * 2
> 
> Even if I get it's not necessarly that every char in str gets expanded
> to two bytes
> 
> > +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {
> 
> std::mbrtoc16 ?
> How come the compiler does not complain ?

With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it
should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,
mbrtoc16() is fine, but the state variable above should be changed.

The C and C++ libraries are allowed to, but not required to, define
functions and types in both the std namespace and the unqualified
namespace. It's unsafe to rely on this though.

> > +		if (rc == static_cast<size_t>(-2) ||
> > +		    rc == static_cast<size_t>(-1))
> > +			break;
> > +
> > +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> > +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));
> 
> I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
> automatically converted, does it ?
> 
> > +
> > +		if (rc > 0)
> > +			ptr += rc;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  /**
> >   * \fn std::vector<typename T::key_type> map_keys(const T &map)
> >   * \brief Retrieve the keys of a std::map<>
Jacopo Mondi Jan. 18, 2021, 9:17 a.m. UTC | #3
Hi Laurent,

On Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote:
> Hi Paul,
>
> Thank you for the patch.
>
> On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:
> > Hi Paul,
> >
> >   I read a few things around, but character encoding seems a very
> >   complex subject, so I mostly have minor comments here
> >
> > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but
> >
> > From what I've read, even referring to Unicode might be mis-leading as
> > it includes a number of different encodings. Do the EXIF specification
> > mention Unicode or any other more specific standard ?
>
> The Exif specification is fairly bad in that regard. It only says
> "unicode", while what it means (or at least what is used in practice
> today, I don't know if that's what was meant originally) is UCS-2. The
> endianness is also not mentioned, and we believe it should match the
> endianness of the TIFF container, but that's not documented.
>
>
> > > are recommended to be in UCS-2. Add a function in utils to help with
> > > this.
>
> Unicode is a standard that contains (among other things) a set of
> characters with their respective code points (a number that identifies
> the character) and a set of character encodings. Encodings define how
> to encode a code point into bytes. UCS-2 is one particular encoding.
> Saying "can be in Unicode but are recommended to be in UCS-2" is thus
> not quite correct.
>
> > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>
> > > ---
> > >  include/libcamera/internal/utils.h |  2 ++
> > >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
> > >  2 files changed, 32 insertions(+)
> > >
> > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > > index f08134af..aa9cc236 100644
> > > --- a/include/libcamera/internal/utils.h
> > > +++ b/include/libcamera/internal/utils.h
> > > @@ -35,6 +35,8 @@ const char *basename(const char *path);
> > >  char *secure_getenv(const char *name);
> > >  std::string dirname(const std::string &path);
> > >
> > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> > > +
>
> I'd move this to the EXIF class in the HAL. There's hardly any chance
> that libcamera will need to convert a string to UCS-2.
>
> > >  template<typename T>
> > >  std::vector<typename T::key_type> map_keys(const T &map)
> > >  {
> > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > > index e90375ae..89cb0f73 100644
> > > --- a/src/libcamera/utils.cpp
> > > +++ b/src/libcamera/utils.cpp
> > > @@ -17,6 +17,7 @@
> > >  #include <string.h>
> > >  #include <sys/stat.h>
> > >  #include <sys/types.h>
> > > +#include <uchar.h>
> > >  #include <unistd.h>
> > >
> > >  /**
> > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
> > >  	return path.substr(0, pos + 1);
> > >  }
> > >
> > > +/**
> > > + * \brief Convert string to byte array of UCS-2
> >
> > a string to a byte array of UCS-2 encoded code point
> >
> > But I wonder, the encoding used to represent the characters in the
> > string I assume depends on some locale, do they ?
>
> std::string has no concept of encoding. It's only an array of char. This
> function assumes that the input is encoded in UTF-8, and outputs a
> UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string

Not sure I get the "concept of encoding". Each char is an 8-bit value
which represents a code point. Different encodings might associate the code
point to a different symbol.

Just after the first statement you said "UTF-8" is assumed, so there's
some assumed encoding.

Anyway, if the whole thing is put as "Expand and UTF-8 string to UTF-16"
it sound a bit more clear.

> doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text
> is valid UTF-16 text.
>
> It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe
> to use UTF-16.
>
> > > + * \param[in] str String to convert
> >
> > The string to convert
> >
> > > + * \param[in] le Little-endian (false for Big-endian)
> >
> > The desired byte-endianess of the converted byte array.
> >
> > An enum would not hurt, but it's not strictly required.
> >
> > > + *
> > > + * \return Byte array of UCS-2 representation of \a str, without null-terminator
> >
> > While it is still not clear to me the distinction between UTF-16 and
> > UCS-2 and I get the two are actually converging over time, the
> > documentaion of std::mbrtoc16 explicitely mentions UTF-16.
> >
> > I guess it again depends on the encoding of \a str (which again
> > depends on the selected locale ?)
> >
> > > + */
> > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)
> >
> > I wonder why we use snake_case in utils ? maybe to mimic STL ?
>
> For some functions, we reuse names defined elsewhere (for instance our
> custom implementation of secure_getenv() for platforms that don't
> provide it). For other functions, it's to mimick STL indeed (for
> instance set_overlap, there's a std::set_union() and
> std::set_intersection()). Then we have duration_to_timespec() and
> time_point_to_string() that should probably be in camelCase.
>
> For this function, I'd use cameraCase. The function should likely be
> named utf8ToUtf16 or something similar, to make it clear what the source
> encoding is.
>
> If we want to keep this function a generator UTF-8 to UTF-16 convertor,
> I'd return a std::u16string instead of a std::vector<uint8_t>. This
> would mean handling the endianness conversion in the caller. The
> alternative is to turn this into a UTF-8 to Exif-Unicode function, in
> which case you can keep the le argument, but I would then rename it to
> endianness and use the Exif endianness macros as values instead of
> making it a bool.
>
> > > +{
> > > +	std::mbstate_t state{};
> > > +	char16_t c16;
> > > +	const char *ptr = &str[0], *end = &str[0] + str.size();
> >
> > One variable per line and maybe
> >         const char *end = &str.back()
>
> back() isn't the same.
>

Ups
std::string::back() : equivalent to operator[](size() - 1)

> 	const char *ptr = str.data();
> 	const char *end = ptr + str.size();
>
> > > +
> > > +	std::vector<uint8_t> ret;
> >
> > I would reserve str.size() * 2
> >
> > Even if I get it's not necessarly that every char in str gets expanded
> > to two bytes
> >
> > > +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {
> >
> > std::mbrtoc16 ?
> > How come the compiler does not complain ?
>
> With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it
> should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,
> mbrtoc16() is fine, but the state variable above should be changed.

Uh, I assumed the C++ header having seen std::mbstate_t

>
> The C and C++ libraries are allowed to, but not required to, define
> functions and types in both the std namespace and the unqualified
> namespace. It's unsafe to rely on this though.
>
> > > +		if (rc == static_cast<size_t>(-2) ||
> > > +		    rc == static_cast<size_t>(-1))
> > > +			break;
> > > +
> > > +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> > > +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));
> >
> > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
> > automatically converted, does it ?
> >
> > > +
> > > +		if (rc > 0)
> > > +			ptr += rc;
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  /**
> > >   * \fn std::vector<typename T::key_type> map_keys(const T &map)
> > >   * \brief Retrieve the keys of a std::map<>
>
> --
> Regards,
>
> Laurent Pinchart
Laurent Pinchart Jan. 18, 2021, 9:26 a.m. UTC | #4
Hi Jacopo,

On Mon, Jan 18, 2021 at 10:17:11AM +0100, Jacopo Mondi wrote:
> On Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote:
> > On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:
> > > Hi Paul,
> > >
> > >   I read a few things around, but character encoding seems a very
> > >   complex subject, so I mostly have minor comments here
> > >
> > > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:
> > > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but
> > >
> > > From what I've read, even referring to Unicode might be mis-leading as
> > > it includes a number of different encodings. Do the EXIF specification
> > > mention Unicode or any other more specific standard ?
> >
> > The Exif specification is fairly bad in that regard. It only says
> > "unicode", while what it means (or at least what is used in practice
> > today, I don't know if that's what was meant originally) is UCS-2. The
> > endianness is also not mentioned, and we believe it should match the
> > endianness of the TIFF container, but that's not documented.
> >
> > > > are recommended to be in UCS-2. Add a function in utils to help with
> > > > this.
> >
> > Unicode is a standard that contains (among other things) a set of
> > characters with their respective code points (a number that identifies
> > the character) and a set of character encodings. Encodings define how
> > to encode a code point into bytes. UCS-2 is one particular encoding.
> > Saying "can be in Unicode but are recommended to be in UCS-2" is thus
> > not quite correct.
> >
> > > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>
> > > > ---
> > > >  include/libcamera/internal/utils.h |  2 ++
> > > >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++
> > > >  2 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > > > index f08134af..aa9cc236 100644
> > > > --- a/include/libcamera/internal/utils.h
> > > > +++ b/include/libcamera/internal/utils.h
> > > > @@ -35,6 +35,8 @@ const char *basename(const char *path);
> > > >  char *secure_getenv(const char *name);
> > > >  std::string dirname(const std::string &path);
> > > >
> > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
> > > > +
> >
> > I'd move this to the EXIF class in the HAL. There's hardly any chance
> > that libcamera will need to convert a string to UCS-2.
> >
> > > >  template<typename T>
> > > >  std::vector<typename T::key_type> map_keys(const T &map)
> > > >  {
> > > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > > > index e90375ae..89cb0f73 100644
> > > > --- a/src/libcamera/utils.cpp
> > > > +++ b/src/libcamera/utils.cpp
> > > > @@ -17,6 +17,7 @@
> > > >  #include <string.h>
> > > >  #include <sys/stat.h>
> > > >  #include <sys/types.h>
> > > > +#include <uchar.h>
> > > >  #include <unistd.h>
> > > >
> > > >  /**
> > > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)
> > > >  	return path.substr(0, pos + 1);
> > > >  }
> > > >
> > > > +/**
> > > > + * \brief Convert string to byte array of UCS-2
> > >
> > > a string to a byte array of UCS-2 encoded code point
> > >
> > > But I wonder, the encoding used to represent the characters in the
> > > string I assume depends on some locale, do they ?
> >
> > std::string has no concept of encoding. It's only an array of char. This
> > function assumes that the input is encoded in UTF-8, and outputs a
> > UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string
> 
> Not sure I get the "concept of encoding". Each char is an 8-bit value
> which represents a code point. Different encodings might associate the code
> point to a different symbol.

In unicode a code point is a 32-bit value. The encoding then defines how
the code point is encoded in bytes (in a variable number of bytes for
UTF-8, a single byte that can only represent a subset of unicode for
ISO-8859-*, a variable number of u16 for UTF-16, a single u16 that can
only represent a subdev for unicode for UCS-2, ...). An std::string
doesn't specify any encoding. As it stores a string as a null-terminated
array of char, any char-based encoding is usable (UTF-8, ISO-8859-*,
...) but not "wide" char-based encodings (UTF-16, UCS-2, ...).

> Just after the first statement you said "UTF-8" is assumed, so there's
> some assumed encoding.
> 
> Anyway, if the whole thing is put as "Expand and UTF-8 string to UTF-16"
> it sound a bit more clear.
> 
> > doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text
> > is valid UTF-16 text.
> >
> > It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe
> > to use UTF-16.
> >
> > > > + * \param[in] str String to convert
> > >
> > > The string to convert
> > >
> > > > + * \param[in] le Little-endian (false for Big-endian)
> > >
> > > The desired byte-endianess of the converted byte array.
> > >
> > > An enum would not hurt, but it's not strictly required.
> > >
> > > > + *
> > > > + * \return Byte array of UCS-2 representation of \a str, without null-terminator
> > >
> > > While it is still not clear to me the distinction between UTF-16 and
> > > UCS-2 and I get the two are actually converging over time, the
> > > documentaion of std::mbrtoc16 explicitely mentions UTF-16.
> > >
> > > I guess it again depends on the encoding of \a str (which again
> > > depends on the selected locale ?)
> > >
> > > > + */
> > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)
> > >
> > > I wonder why we use snake_case in utils ? maybe to mimic STL ?
> >
> > For some functions, we reuse names defined elsewhere (for instance our
> > custom implementation of secure_getenv() for platforms that don't
> > provide it). For other functions, it's to mimick STL indeed (for
> > instance set_overlap, there's a std::set_union() and
> > std::set_intersection()). Then we have duration_to_timespec() and
> > time_point_to_string() that should probably be in camelCase.
> >
> > For this function, I'd use cameraCase. The function should likely be
> > named utf8ToUtf16 or something similar, to make it clear what the source
> > encoding is.
> >
> > If we want to keep this function a generator UTF-8 to UTF-16 convertor,
> > I'd return a std::u16string instead of a std::vector<uint8_t>. This
> > would mean handling the endianness conversion in the caller. The
> > alternative is to turn this into a UTF-8 to Exif-Unicode function, in
> > which case you can keep the le argument, but I would then rename it to
> > endianness and use the Exif endianness macros as values instead of
> > making it a bool.
> >
> > > > +{
> > > > +	std::mbstate_t state{};
> > > > +	char16_t c16;
> > > > +	const char *ptr = &str[0], *end = &str[0] + str.size();
> > >
> > > One variable per line and maybe
> > >         const char *end = &str.back()
> >
> > back() isn't the same.
> 
> Ups
> std::string::back() : equivalent to operator[](size() - 1)
> 
> > 	const char *ptr = str.data();
> > 	const char *end = ptr + str.size();
> >
> > > > +
> > > > +	std::vector<uint8_t> ret;
> > >
> > > I would reserve str.size() * 2
> > >
> > > Even if I get it's not necessarly that every char in str gets expanded
> > > to two bytes
> > >
> > > > +	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {
> > >
> > > std::mbrtoc16 ?
> > > How come the compiler does not complain ?
> >
> > With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it
> > should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,
> > mbrtoc16() is fine, but the state variable above should be changed.
> 
> Uh, I assumed the C++ header having seen std::mbstate_t
> 
> > The C and C++ libraries are allowed to, but not required to, define
> > functions and types in both the std namespace and the unqualified
> > namespace. It's unsafe to rely on this though.
> >
> > > > +		if (rc == static_cast<size_t>(-2) ||
> > > > +		    rc == static_cast<size_t>(-1))
> > > > +			break;
> > > > +
> > > > +		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
> > > > +		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));
> > >
> > > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets
> > > automatically converted, does it ?
> > >
> > > > +
> > > > +		if (rc > 0)
> > > > +			ptr += rc;
> > > > +	}
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  /**
> > > >   * \fn std::vector<typename T::key_type> map_keys(const T &map)
> > > >   * \brief Retrieve the keys of a std::map<>

Patch
diff mbox series

diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
index f08134af..aa9cc236 100644
--- a/include/libcamera/internal/utils.h
+++ b/include/libcamera/internal/utils.h
@@ -35,6 +35,8 @@  const char *basename(const char *path);
 char *secure_getenv(const char *name);
 std::string dirname(const std::string &path);
 
+std::vector<uint8_t> string_to_c16(const std::string &str, bool le);
+
 template<typename T>
 std::vector<typename T::key_type> map_keys(const T &map)
 {
diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
index e90375ae..89cb0f73 100644
--- a/src/libcamera/utils.cpp
+++ b/src/libcamera/utils.cpp
@@ -17,6 +17,7 @@ 
 #include <string.h>
 #include <sys/stat.h>
 #include <sys/types.h>
+#include <uchar.h>
 #include <unistd.h>
 
 /**
@@ -122,6 +123,35 @@  std::string dirname(const std::string &path)
 	return path.substr(0, pos + 1);
 }
 
+/**
+ * \brief Convert string to byte array of UCS-2
+ * \param[in] str String to convert
+ * \param[in] le Little-endian (false for Big-endian)
+ *
+ * \return Byte array of UCS-2 representation of \a str, without null-terminator
+ */
+std::vector<uint8_t> string_to_c16(const std::string &str, bool le)
+{
+	std::mbstate_t state{};
+	char16_t c16;
+	const char *ptr = &str[0], *end = &str[0] + str.size();
+
+	std::vector<uint8_t> ret;
+	while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {
+		if (rc == static_cast<size_t>(-2) ||
+		    rc == static_cast<size_t>(-1))
+			break;
+
+		ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));
+		ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));
+
+		if (rc > 0)
+			ptr += rc;
+	}
+
+	return ret;
+}
+
 /**
  * \fn std::vector<typename T::key_type> map_keys(const T &map)
  * \brief Retrieve the keys of a std::map<>