Message ID | 20210114104035.302968-2-paul.elder@ideasonboard.com |
---|---|
State | Superseded |
Delegated to: | Paul Elder |
Headers | show |
Series |
|
Related | show |
Hi Paul, I read a few things around, but character encoding seems a very complex subject, so I mostly have minor comments here On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote: > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but From what I've read, even referring to Unicode might be mis-leading as it includes a number of different encodings. Do the EXIF specification mention Unicode or any other more specific standard ? > are recommended to be in UCS-2. Add a function in utils to help with > this. > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com> > --- > include/libcamera/internal/utils.h | 2 ++ > src/libcamera/utils.cpp | 30 ++++++++++++++++++++++++++++++ > 2 files changed, 32 insertions(+) > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h > index f08134af..aa9cc236 100644 > --- a/include/libcamera/internal/utils.h > +++ b/include/libcamera/internal/utils.h > @@ -35,6 +35,8 @@ const char *basename(const char *path); > char *secure_getenv(const char *name); > std::string dirname(const std::string &path); > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le); > + > template<typename T> > std::vector<typename T::key_type> map_keys(const T &map) > { > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp > index e90375ae..89cb0f73 100644 > --- a/src/libcamera/utils.cpp > +++ b/src/libcamera/utils.cpp > @@ -17,6 +17,7 @@ > #include <string.h> > #include <sys/stat.h> > #include <sys/types.h> > +#include <uchar.h> > #include <unistd.h> > > /** > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path) > return path.substr(0, pos + 1); > } > > +/** > + * \brief Convert string to byte array of UCS-2 a string to a byte array of UCS-2 encoded code point But I wonder, the encoding used to represent the characters in the string I assume depends on some locale, do they ? > + * \param[in] str String to convert The string to convert > + * \param[in] le Little-endian (false for Big-endian) The desired byte-endianess of the converted byte array. An enum would not hurt, but it's not strictly required. > + * > + * \return Byte array of UCS-2 representation of \a str, without null-terminator While it is still not clear to me the distinction between UTF-16 and UCS-2 and I get the two are actually converging over time, the documentaion of std::mbrtoc16 explicitely mentions UTF-16. I guess it again depends on the encoding of \a str (which again depends on the selected locale ?) > + */ > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le) I wonder why we use snake_case in utils ? maybe to mimic STL ? > +{ > + std::mbstate_t state{}; > + char16_t c16; > + const char *ptr = &str[0], *end = &str[0] + str.size(); One variable per line and maybe const char *end = &str.back() > + > + std::vector<uint8_t> ret; I would reserve str.size() * 2 Even if I get it's not necessarly that every char in str gets expanded to two bytes > + while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) { std::mbrtoc16 ? How come the compiler does not complain ? > + if (rc == static_cast<size_t>(-2) || > + rc == static_cast<size_t>(-1)) > + break; > + > + ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff)); > + ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff)); I think you can avoid & 0xff as being ret an array of uint8_t c16 gets automatically converted, does it ? > + > + if (rc > 0) > + ptr += rc; > + } > + > + return ret; > +} > + > /** > * \fn std::vector<typename T::key_type> map_keys(const T &map) > * \brief Retrieve the keys of a std::map<> > -- > 2.27.0 > > _______________________________________________ > libcamera-devel mailing list > libcamera-devel@lists.libcamera.org > https://lists.libcamera.org/listinfo/libcamera-devel
Hi Paul, Thank you for the patch. On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote: > Hi Paul, > > I read a few things around, but character encoding seems a very > complex subject, so I mostly have minor comments here > > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote: > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but > > From what I've read, even referring to Unicode might be mis-leading as > it includes a number of different encodings. Do the EXIF specification > mention Unicode or any other more specific standard ? The Exif specification is fairly bad in that regard. It only says "unicode", while what it means (or at least what is used in practice today, I don't know if that's what was meant originally) is UCS-2. The endianness is also not mentioned, and we believe it should match the endianness of the TIFF container, but that's not documented. > > are recommended to be in UCS-2. Add a function in utils to help with > > this. Unicode is a standard that contains (among other things) a set of characters with their respective code points (a number that identifies the character) and a set of character encodings. Encodings define how to encode a code point into bytes. UCS-2 is one particular encoding. Saying "can be in Unicode but are recommended to be in UCS-2" is thus not quite correct. > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com> > > --- > > include/libcamera/internal/utils.h | 2 ++ > > src/libcamera/utils.cpp | 30 ++++++++++++++++++++++++++++++ > > 2 files changed, 32 insertions(+) > > > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h > > index f08134af..aa9cc236 100644 > > --- a/include/libcamera/internal/utils.h > > +++ b/include/libcamera/internal/utils.h > > @@ -35,6 +35,8 @@ const char *basename(const char *path); > > char *secure_getenv(const char *name); > > std::string dirname(const std::string &path); > > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le); > > + I'd move this to the EXIF class in the HAL. There's hardly any chance that libcamera will need to convert a string to UCS-2. > > template<typename T> > > std::vector<typename T::key_type> map_keys(const T &map) > > { > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp > > index e90375ae..89cb0f73 100644 > > --- a/src/libcamera/utils.cpp > > +++ b/src/libcamera/utils.cpp > > @@ -17,6 +17,7 @@ > > #include <string.h> > > #include <sys/stat.h> > > #include <sys/types.h> > > +#include <uchar.h> > > #include <unistd.h> > > > > /** > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path) > > return path.substr(0, pos + 1); > > } > > > > +/** > > + * \brief Convert string to byte array of UCS-2 > > a string to a byte array of UCS-2 encoded code point > > But I wonder, the encoding used to represent the characters in the > string I assume depends on some locale, do they ? std::string has no concept of encoding. It's only an array of char. This function assumes that the input is encoded in UTF-8, and outputs a UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text is valid UTF-16 text. It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe to use UTF-16. > > + * \param[in] str String to convert > > The string to convert > > > + * \param[in] le Little-endian (false for Big-endian) > > The desired byte-endianess of the converted byte array. > > An enum would not hurt, but it's not strictly required. > > > + * > > + * \return Byte array of UCS-2 representation of \a str, without null-terminator > > While it is still not clear to me the distinction between UTF-16 and > UCS-2 and I get the two are actually converging over time, the > documentaion of std::mbrtoc16 explicitely mentions UTF-16. > > I guess it again depends on the encoding of \a str (which again > depends on the selected locale ?) > > > + */ > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le) > > I wonder why we use snake_case in utils ? maybe to mimic STL ? For some functions, we reuse names defined elsewhere (for instance our custom implementation of secure_getenv() for platforms that don't provide it). For other functions, it's to mimick STL indeed (for instance set_overlap, there's a std::set_union() and std::set_intersection()). Then we have duration_to_timespec() and time_point_to_string() that should probably be in camelCase. For this function, I'd use cameraCase. The function should likely be named utf8ToUtf16 or something similar, to make it clear what the source encoding is. If we want to keep this function a generator UTF-8 to UTF-16 convertor, I'd return a std::u16string instead of a std::vector<uint8_t>. This would mean handling the endianness conversion in the caller. The alternative is to turn this into a UTF-8 to Exif-Unicode function, in which case you can keep the le argument, but I would then rename it to endianness and use the Exif endianness macros as values instead of making it a bool. > > +{ > > + std::mbstate_t state{}; > > + char16_t c16; > > + const char *ptr = &str[0], *end = &str[0] + str.size(); > > One variable per line and maybe > const char *end = &str.back() back() isn't the same. const char *ptr = str.data(); const char *end = ptr + str.size(); > > + > > + std::vector<uint8_t> ret; > > I would reserve str.size() * 2 > > Even if I get it's not necessarly that every char in str gets expanded > to two bytes > > > + while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) { > > std::mbrtoc16 ? > How come the compiler does not complain ? With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it should be std::mbstate_t and std::mbrtoc16(). As we use the C headers, mbrtoc16() is fine, but the state variable above should be changed. The C and C++ libraries are allowed to, but not required to, define functions and types in both the std namespace and the unqualified namespace. It's unsafe to rely on this though. > > + if (rc == static_cast<size_t>(-2) || > > + rc == static_cast<size_t>(-1)) > > + break; > > + > > + ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff)); > > + ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff)); > > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets > automatically converted, does it ? > > > + > > + if (rc > 0) > > + ptr += rc; > > + } > > + > > + return ret; > > +} > > + > > /** > > * \fn std::vector<typename T::key_type> map_keys(const T &map) > > * \brief Retrieve the keys of a std::map<>
Hi Laurent, On Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote: > Hi Paul, > > Thank you for the patch. > > On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote: > > Hi Paul, > > > > I read a few things around, but character encoding seems a very > > complex subject, so I mostly have minor comments here > > > > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote: > > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but > > > > From what I've read, even referring to Unicode might be mis-leading as > > it includes a number of different encodings. Do the EXIF specification > > mention Unicode or any other more specific standard ? > > The Exif specification is fairly bad in that regard. It only says > "unicode", while what it means (or at least what is used in practice > today, I don't know if that's what was meant originally) is UCS-2. The > endianness is also not mentioned, and we believe it should match the > endianness of the TIFF container, but that's not documented. > > > > > are recommended to be in UCS-2. Add a function in utils to help with > > > this. > > Unicode is a standard that contains (among other things) a set of > characters with their respective code points (a number that identifies > the character) and a set of character encodings. Encodings define how > to encode a code point into bytes. UCS-2 is one particular encoding. > Saying "can be in Unicode but are recommended to be in UCS-2" is thus > not quite correct. > > > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com> > > > --- > > > include/libcamera/internal/utils.h | 2 ++ > > > src/libcamera/utils.cpp | 30 ++++++++++++++++++++++++++++++ > > > 2 files changed, 32 insertions(+) > > > > > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h > > > index f08134af..aa9cc236 100644 > > > --- a/include/libcamera/internal/utils.h > > > +++ b/include/libcamera/internal/utils.h > > > @@ -35,6 +35,8 @@ const char *basename(const char *path); > > > char *secure_getenv(const char *name); > > > std::string dirname(const std::string &path); > > > > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le); > > > + > > I'd move this to the EXIF class in the HAL. There's hardly any chance > that libcamera will need to convert a string to UCS-2. > > > > template<typename T> > > > std::vector<typename T::key_type> map_keys(const T &map) > > > { > > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp > > > index e90375ae..89cb0f73 100644 > > > --- a/src/libcamera/utils.cpp > > > +++ b/src/libcamera/utils.cpp > > > @@ -17,6 +17,7 @@ > > > #include <string.h> > > > #include <sys/stat.h> > > > #include <sys/types.h> > > > +#include <uchar.h> > > > #include <unistd.h> > > > > > > /** > > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path) > > > return path.substr(0, pos + 1); > > > } > > > > > > +/** > > > + * \brief Convert string to byte array of UCS-2 > > > > a string to a byte array of UCS-2 encoded code point > > > > But I wonder, the encoding used to represent the characters in the > > string I assume depends on some locale, do they ? > > std::string has no concept of encoding. It's only an array of char. This > function assumes that the input is encoded in UTF-8, and outputs a > UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string Not sure I get the "concept of encoding". Each char is an 8-bit value which represents a code point. Different encodings might associate the code point to a different symbol. Just after the first statement you said "UTF-8" is assumed, so there's some assumed encoding. Anyway, if the whole thing is put as "Expand and UTF-8 string to UTF-16" it sound a bit more clear. > doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text > is valid UTF-16 text. > > It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe > to use UTF-16. > > > > + * \param[in] str String to convert > > > > The string to convert > > > > > + * \param[in] le Little-endian (false for Big-endian) > > > > The desired byte-endianess of the converted byte array. > > > > An enum would not hurt, but it's not strictly required. > > > > > + * > > > + * \return Byte array of UCS-2 representation of \a str, without null-terminator > > > > While it is still not clear to me the distinction between UTF-16 and > > UCS-2 and I get the two are actually converging over time, the > > documentaion of std::mbrtoc16 explicitely mentions UTF-16. > > > > I guess it again depends on the encoding of \a str (which again > > depends on the selected locale ?) > > > > > + */ > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le) > > > > I wonder why we use snake_case in utils ? maybe to mimic STL ? > > For some functions, we reuse names defined elsewhere (for instance our > custom implementation of secure_getenv() for platforms that don't > provide it). For other functions, it's to mimick STL indeed (for > instance set_overlap, there's a std::set_union() and > std::set_intersection()). Then we have duration_to_timespec() and > time_point_to_string() that should probably be in camelCase. > > For this function, I'd use cameraCase. The function should likely be > named utf8ToUtf16 or something similar, to make it clear what the source > encoding is. > > If we want to keep this function a generator UTF-8 to UTF-16 convertor, > I'd return a std::u16string instead of a std::vector<uint8_t>. This > would mean handling the endianness conversion in the caller. The > alternative is to turn this into a UTF-8 to Exif-Unicode function, in > which case you can keep the le argument, but I would then rename it to > endianness and use the Exif endianness macros as values instead of > making it a bool. > > > > +{ > > > + std::mbstate_t state{}; > > > + char16_t c16; > > > + const char *ptr = &str[0], *end = &str[0] + str.size(); > > > > One variable per line and maybe > > const char *end = &str.back() > > back() isn't the same. > Ups std::string::back() : equivalent to operator[](size() - 1) > const char *ptr = str.data(); > const char *end = ptr + str.size(); > > > > + > > > + std::vector<uint8_t> ret; > > > > I would reserve str.size() * 2 > > > > Even if I get it's not necessarly that every char in str gets expanded > > to two bytes > > > > > + while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) { > > > > std::mbrtoc16 ? > > How come the compiler does not complain ? > > With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it > should be std::mbstate_t and std::mbrtoc16(). As we use the C headers, > mbrtoc16() is fine, but the state variable above should be changed. Uh, I assumed the C++ header having seen std::mbstate_t > > The C and C++ libraries are allowed to, but not required to, define > functions and types in both the std namespace and the unqualified > namespace. It's unsafe to rely on this though. > > > > + if (rc == static_cast<size_t>(-2) || > > > + rc == static_cast<size_t>(-1)) > > > + break; > > > + > > > + ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff)); > > > + ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff)); > > > > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets > > automatically converted, does it ? > > > > > + > > > + if (rc > 0) > > > + ptr += rc; > > > + } > > > + > > > + return ret; > > > +} > > > + > > > /** > > > * \fn std::vector<typename T::key_type> map_keys(const T &map) > > > * \brief Retrieve the keys of a std::map<> > > -- > Regards, > > Laurent Pinchart
Hi Jacopo, On Mon, Jan 18, 2021 at 10:17:11AM +0100, Jacopo Mondi wrote: > On Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote: > > On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote: > > > Hi Paul, > > > > > > I read a few things around, but character encoding seems a very > > > complex subject, so I mostly have minor comments here > > > > > > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote: > > > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but > > > > > > From what I've read, even referring to Unicode might be mis-leading as > > > it includes a number of different encodings. Do the EXIF specification > > > mention Unicode or any other more specific standard ? > > > > The Exif specification is fairly bad in that regard. It only says > > "unicode", while what it means (or at least what is used in practice > > today, I don't know if that's what was meant originally) is UCS-2. The > > endianness is also not mentioned, and we believe it should match the > > endianness of the TIFF container, but that's not documented. > > > > > > are recommended to be in UCS-2. Add a function in utils to help with > > > > this. > > > > Unicode is a standard that contains (among other things) a set of > > characters with their respective code points (a number that identifies > > the character) and a set of character encodings. Encodings define how > > to encode a code point into bytes. UCS-2 is one particular encoding. > > Saying "can be in Unicode but are recommended to be in UCS-2" is thus > > not quite correct. > > > > > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com> > > > > --- > > > > include/libcamera/internal/utils.h | 2 ++ > > > > src/libcamera/utils.cpp | 30 ++++++++++++++++++++++++++++++ > > > > 2 files changed, 32 insertions(+) > > > > > > > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h > > > > index f08134af..aa9cc236 100644 > > > > --- a/include/libcamera/internal/utils.h > > > > +++ b/include/libcamera/internal/utils.h > > > > @@ -35,6 +35,8 @@ const char *basename(const char *path); > > > > char *secure_getenv(const char *name); > > > > std::string dirname(const std::string &path); > > > > > > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le); > > > > + > > > > I'd move this to the EXIF class in the HAL. There's hardly any chance > > that libcamera will need to convert a string to UCS-2. > > > > > > template<typename T> > > > > std::vector<typename T::key_type> map_keys(const T &map) > > > > { > > > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp > > > > index e90375ae..89cb0f73 100644 > > > > --- a/src/libcamera/utils.cpp > > > > +++ b/src/libcamera/utils.cpp > > > > @@ -17,6 +17,7 @@ > > > > #include <string.h> > > > > #include <sys/stat.h> > > > > #include <sys/types.h> > > > > +#include <uchar.h> > > > > #include <unistd.h> > > > > > > > > /** > > > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path) > > > > return path.substr(0, pos + 1); > > > > } > > > > > > > > +/** > > > > + * \brief Convert string to byte array of UCS-2 > > > > > > a string to a byte array of UCS-2 encoded code point > > > > > > But I wonder, the encoding used to represent the characters in the > > > string I assume depends on some locale, do they ? > > > > std::string has no concept of encoding. It's only an array of char. This > > function assumes that the input is encoded in UTF-8, and outputs a > > UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string > > Not sure I get the "concept of encoding". Each char is an 8-bit value > which represents a code point. Different encodings might associate the code > point to a different symbol. In unicode a code point is a 32-bit value. The encoding then defines how the code point is encoded in bytes (in a variable number of bytes for UTF-8, a single byte that can only represent a subset of unicode for ISO-8859-*, a variable number of u16 for UTF-16, a single u16 that can only represent a subdev for unicode for UCS-2, ...). An std::string doesn't specify any encoding. As it stores a string as a null-terminated array of char, any char-based encoding is usable (UTF-8, ISO-8859-*, ...) but not "wide" char-based encodings (UTF-16, UCS-2, ...). > Just after the first statement you said "UTF-8" is assumed, so there's > some assumed encoding. > > Anyway, if the whole thing is put as "Expand and UTF-8 string to UTF-16" > it sound a bit more clear. > > > doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text > > is valid UTF-16 text. > > > > It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe > > to use UTF-16. > > > > > > + * \param[in] str String to convert > > > > > > The string to convert > > > > > > > + * \param[in] le Little-endian (false for Big-endian) > > > > > > The desired byte-endianess of the converted byte array. > > > > > > An enum would not hurt, but it's not strictly required. > > > > > > > + * > > > > + * \return Byte array of UCS-2 representation of \a str, without null-terminator > > > > > > While it is still not clear to me the distinction between UTF-16 and > > > UCS-2 and I get the two are actually converging over time, the > > > documentaion of std::mbrtoc16 explicitely mentions UTF-16. > > > > > > I guess it again depends on the encoding of \a str (which again > > > depends on the selected locale ?) > > > > > > > + */ > > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le) > > > > > > I wonder why we use snake_case in utils ? maybe to mimic STL ? > > > > For some functions, we reuse names defined elsewhere (for instance our > > custom implementation of secure_getenv() for platforms that don't > > provide it). For other functions, it's to mimick STL indeed (for > > instance set_overlap, there's a std::set_union() and > > std::set_intersection()). Then we have duration_to_timespec() and > > time_point_to_string() that should probably be in camelCase. > > > > For this function, I'd use cameraCase. The function should likely be > > named utf8ToUtf16 or something similar, to make it clear what the source > > encoding is. > > > > If we want to keep this function a generator UTF-8 to UTF-16 convertor, > > I'd return a std::u16string instead of a std::vector<uint8_t>. This > > would mean handling the endianness conversion in the caller. The > > alternative is to turn this into a UTF-8 to Exif-Unicode function, in > > which case you can keep the le argument, but I would then rename it to > > endianness and use the Exif endianness macros as values instead of > > making it a bool. > > > > > > +{ > > > > + std::mbstate_t state{}; > > > > + char16_t c16; > > > > + const char *ptr = &str[0], *end = &str[0] + str.size(); > > > > > > One variable per line and maybe > > > const char *end = &str.back() > > > > back() isn't the same. > > Ups > std::string::back() : equivalent to operator[](size() - 1) > > > const char *ptr = str.data(); > > const char *end = ptr + str.size(); > > > > > > + > > > > + std::vector<uint8_t> ret; > > > > > > I would reserve str.size() * 2 > > > > > > Even if I get it's not necessarly that every char in str gets expanded > > > to two bytes > > > > > > > + while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) { > > > > > > std::mbrtoc16 ? > > > How come the compiler does not complain ? > > > > With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it > > should be std::mbstate_t and std::mbrtoc16(). As we use the C headers, > > mbrtoc16() is fine, but the state variable above should be changed. > > Uh, I assumed the C++ header having seen std::mbstate_t > > > The C and C++ libraries are allowed to, but not required to, define > > functions and types in both the std namespace and the unqualified > > namespace. It's unsafe to rely on this though. > > > > > > + if (rc == static_cast<size_t>(-2) || > > > > + rc == static_cast<size_t>(-1)) > > > > + break; > > > > + > > > > + ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff)); > > > > + ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff)); > > > > > > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets > > > automatically converted, does it ? > > > > > > > + > > > > + if (rc > 0) > > > > + ptr += rc; > > > > + } > > > > + > > > > + return ret; > > > > +} > > > > + > > > > /** > > > > * \fn std::vector<typename T::key_type> map_keys(const T &map) > > > > * \brief Retrieve the keys of a std::map<>
diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h index f08134af..aa9cc236 100644 --- a/include/libcamera/internal/utils.h +++ b/include/libcamera/internal/utils.h @@ -35,6 +35,8 @@ const char *basename(const char *path); char *secure_getenv(const char *name); std::string dirname(const std::string &path); +std::vector<uint8_t> string_to_c16(const std::string &str, bool le); + template<typename T> std::vector<typename T::key_type> map_keys(const T &map) { diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp index e90375ae..89cb0f73 100644 --- a/src/libcamera/utils.cpp +++ b/src/libcamera/utils.cpp @@ -17,6 +17,7 @@ #include <string.h> #include <sys/stat.h> #include <sys/types.h> +#include <uchar.h> #include <unistd.h> /** @@ -122,6 +123,35 @@ std::string dirname(const std::string &path) return path.substr(0, pos + 1); } +/** + * \brief Convert string to byte array of UCS-2 + * \param[in] str String to convert + * \param[in] le Little-endian (false for Big-endian) + * + * \return Byte array of UCS-2 representation of \a str, without null-terminator + */ +std::vector<uint8_t> string_to_c16(const std::string &str, bool le) +{ + std::mbstate_t state{}; + char16_t c16; + const char *ptr = &str[0], *end = &str[0] + str.size(); + + std::vector<uint8_t> ret; + while (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) { + if (rc == static_cast<size_t>(-2) || + rc == static_cast<size_t>(-1)) + break; + + ret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff)); + ret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff)); + + if (rc > 0) + ptr += rc; + } + + return ret; +} + /** * \fn std::vector<typename T::key_type> map_keys(const T &map) * \brief Retrieve the keys of a std::map<>
GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but are recommended to be in UCS-2. Add a function in utils to help with this. Signed-off-by: Paul Elder <paul.elder@ideasonboard.com> --- include/libcamera/internal/utils.h | 2 ++ src/libcamera/utils.cpp | 30 ++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+)