[{"id":14548,"web_url":"https://patchwork.libcamera.org/comment/14548/","msgid":"<20210115141953.cvla5ym32ypqhw2l@uno.localdomain>","date":"2021-01-15T14:19:53","subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","submitter":{"id":3,"url":"https://patchwork.libcamera.org/api/people/3/","name":"Jacopo Mondi","email":"jacopo@jmondi.org"},"content":"Hi Paul,\n\n  I read a few things around, but character encoding seems a very\n  complex subject, so I mostly have minor comments here\n\nOn Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:\n> GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but\n\nFrom what I've read, even referring to Unicode might be mis-leading as\nit includes a number of different encodings. Do the EXIF specification\nmention Unicode or any other more specific standard ?\n\n> are recommended to be in UCS-2. Add a function in utils to help with\n> this.\n>\n> Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>\n> ---\n>  include/libcamera/internal/utils.h |  2 ++\n>  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++\n>  2 files changed, 32 insertions(+)\n>\n> diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h\n> index f08134af..aa9cc236 100644\n> --- a/include/libcamera/internal/utils.h\n> +++ b/include/libcamera/internal/utils.h\n> @@ -35,6 +35,8 @@ const char *basename(const char *path);\n>  char *secure_getenv(const char *name);\n>  std::string dirname(const std::string &path);\n>\n> +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);\n> +\n>  template<typename T>\n>  std::vector<typename T::key_type> map_keys(const T &map)\n>  {\n> diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp\n> index e90375ae..89cb0f73 100644\n> --- a/src/libcamera/utils.cpp\n> +++ b/src/libcamera/utils.cpp\n> @@ -17,6 +17,7 @@\n>  #include <string.h>\n>  #include <sys/stat.h>\n>  #include <sys/types.h>\n> +#include <uchar.h>\n>  #include <unistd.h>\n>\n>  /**\n> @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)\n>  \treturn path.substr(0, pos + 1);\n>  }\n>\n> +/**\n> + * \\brief Convert string to byte array of UCS-2\n\na string to a byte array of UCS-2 encoded code point\n\nBut I wonder, the encoding used to represent the characters in the\nstring I assume depends on some locale, do they ?\n\n> + * \\param[in] str String to convert\n\nThe string to convert\n\n> + * \\param[in] le Little-endian (false for Big-endian)\n\nThe desired byte-endianess of the converted byte array.\n\nAn enum would not hurt, but it's not strictly required.\n\n> + *\n> + * \\return Byte array of UCS-2 representation of \\a str, without null-terminator\n\nWhile it is still not clear to me the distinction between UTF-16 and\nUCS-2 and I get the two are actually converging over time, the\ndocumentaion of std::mbrtoc16 explicitely mentions UTF-16.\n\nI guess it again depends on the encoding of \\a str (which again\ndepends on the selected locale ?)\n\n> + */\n> +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)\n\nI wonder why we use snake_case in utils ? maybe to mimic STL ?\n\n> +{\n> +\tstd::mbstate_t state{};\n> +\tchar16_t c16;\n> +\tconst char *ptr = &str[0], *end = &str[0] + str.size();\n\nOne variable per line and maybe\n        const char *end = &str.back()\n> +\n> +\tstd::vector<uint8_t> ret;\n\nI would reserve str.size() * 2\n\nEven if I get it's not necessarly that every char in str gets expanded\nto two bytes\n\n> +\twhile (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {\n\nstd::mbrtoc16 ?\nHow come the compiler does not complain ?\n\n> +\t\tif (rc == static_cast<size_t>(-2) ||\n> +\t\t    rc == static_cast<size_t>(-1))\n> +\t\t\tbreak;\n> +\n> +\t\tret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));\n> +\t\tret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));\n\nI think you can avoid & 0xff as being ret an array of uint8_t c16 gets\nautomatically converted, does it ?\n\n> +\n> +\t\tif (rc > 0)\n> +\t\t\tptr += rc;\n> +\t}\n> +\n> +\treturn ret;\n> +}\n> +\n>  /**\n>   * \\fn std::vector<typename T::key_type> map_keys(const T &map)\n>   * \\brief Retrieve the keys of a std::map<>\n> --\n> 2.27.0\n>\n> _______________________________________________\n> libcamera-devel mailing list\n> libcamera-devel@lists.libcamera.org\n> https://lists.libcamera.org/listinfo/libcamera-devel","headers":{"Return-Path":"<libcamera-devel-bounces@lists.libcamera.org>","X-Original-To":"parsemail@patchwork.libcamera.org","Delivered-To":"parsemail@patchwork.libcamera.org","Received":["from lancelot.ideasonboard.com (lancelot.ideasonboard.com\n\t[92.243.16.209])\n\tby patchwork.libcamera.org (Postfix) with ESMTPS id 91165BD808\n\tfor <parsemail@patchwork.libcamera.org>;\n\tFri, 15 Jan 2021 14:19:36 +0000 (UTC)","from lancelot.ideasonboard.com (localhost [IPv6:::1])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTP id 15B13680E7;\n\tFri, 15 Jan 2021 15:19:36 +0100 (CET)","from relay6-d.mail.gandi.net (relay6-d.mail.gandi.net\n\t[217.70.183.198])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTPS id 912D6680D1\n\tfor <libcamera-devel@lists.libcamera.org>;\n\tFri, 15 Jan 2021 15:19:35 +0100 (CET)","from uno.localdomain (93-34-118-233.ip49.fastwebnet.it\n\t[93.34.118.233]) (Authenticated sender: jacopo@jmondi.org)\n\tby relay6-d.mail.gandi.net (Postfix) with ESMTPSA id 0D2FFC000B;\n\tFri, 15 Jan 2021 14:19:34 +0000 (UTC)"],"X-Originating-IP":"93.34.118.233","Date":"Fri, 15 Jan 2021 15:19:53 +0100","From":"Jacopo Mondi <jacopo@jmondi.org>","To":"Paul Elder <paul.elder@ideasonboard.com>","Message-ID":"<20210115141953.cvla5ym32ypqhw2l@uno.localdomain>","References":"<20210114104035.302968-1-paul.elder@ideasonboard.com>\n\t<20210114104035.302968-2-paul.elder@ideasonboard.com>","MIME-Version":"1.0","Content-Disposition":"inline","In-Reply-To":"<20210114104035.302968-2-paul.elder@ideasonboard.com>","Subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","X-BeenThere":"libcamera-devel@lists.libcamera.org","X-Mailman-Version":"2.1.29","Precedence":"list","List-Id":"<libcamera-devel.lists.libcamera.org>","List-Unsubscribe":"<https://lists.libcamera.org/options/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=unsubscribe>","List-Archive":"<https://lists.libcamera.org/pipermail/libcamera-devel/>","List-Post":"<mailto:libcamera-devel@lists.libcamera.org>","List-Help":"<mailto:libcamera-devel-request@lists.libcamera.org?subject=help>","List-Subscribe":"<https://lists.libcamera.org/listinfo/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=subscribe>","Cc":"libcamera-devel@lists.libcamera.org","Content-Type":"text/plain; charset=\"us-ascii\"","Content-Transfer-Encoding":"7bit","Errors-To":"libcamera-devel-bounces@lists.libcamera.org","Sender":"\"libcamera-devel\" <libcamera-devel-bounces@lists.libcamera.org>"}},{"id":14553,"web_url":"https://patchwork.libcamera.org/comment/14553/","msgid":"<YAHKbwPOfvDWmpH/@pendragon.ideasonboard.com>","date":"2021-01-15T17:01:35","subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","submitter":{"id":2,"url":"https://patchwork.libcamera.org/api/people/2/","name":"Laurent Pinchart","email":"laurent.pinchart@ideasonboard.com"},"content":"Hi Paul,\n\nThank you for the patch.\n\nOn Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:\n> Hi Paul,\n> \n>   I read a few things around, but character encoding seems a very\n>   complex subject, so I mostly have minor comments here\n> \n> On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:\n> > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but\n> \n> From what I've read, even referring to Unicode might be mis-leading as\n> it includes a number of different encodings. Do the EXIF specification\n> mention Unicode or any other more specific standard ?\n\nThe Exif specification is fairly bad in that regard. It only says\n\"unicode\", while what it means (or at least what is used in practice\ntoday, I don't know if that's what was meant originally) is UCS-2. The\nendianness is also not mentioned, and we believe it should match the\nendianness of the TIFF container, but that's not documented.\n\n\n> > are recommended to be in UCS-2. Add a function in utils to help with\n> > this.\n\nUnicode is a standard that contains (among other things) a set of\ncharacters with their respective code points (a number that identifies\nthe character) and a set of character encodings. Encodings define how\nto encode a code point into bytes. UCS-2 is one particular encoding.\nSaying \"can be in Unicode but are recommended to be in UCS-2\" is thus\nnot quite correct.\n\n> > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>\n> > ---\n> >  include/libcamera/internal/utils.h |  2 ++\n> >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++\n> >  2 files changed, 32 insertions(+)\n> >\n> > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h\n> > index f08134af..aa9cc236 100644\n> > --- a/include/libcamera/internal/utils.h\n> > +++ b/include/libcamera/internal/utils.h\n> > @@ -35,6 +35,8 @@ const char *basename(const char *path);\n> >  char *secure_getenv(const char *name);\n> >  std::string dirname(const std::string &path);\n> >\n> > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);\n> > +\n\nI'd move this to the EXIF class in the HAL. There's hardly any chance\nthat libcamera will need to convert a string to UCS-2.\n\n> >  template<typename T>\n> >  std::vector<typename T::key_type> map_keys(const T &map)\n> >  {\n> > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp\n> > index e90375ae..89cb0f73 100644\n> > --- a/src/libcamera/utils.cpp\n> > +++ b/src/libcamera/utils.cpp\n> > @@ -17,6 +17,7 @@\n> >  #include <string.h>\n> >  #include <sys/stat.h>\n> >  #include <sys/types.h>\n> > +#include <uchar.h>\n> >  #include <unistd.h>\n> >\n> >  /**\n> > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)\n> >  \treturn path.substr(0, pos + 1);\n> >  }\n> >\n> > +/**\n> > + * \\brief Convert string to byte array of UCS-2\n> \n> a string to a byte array of UCS-2 encoded code point\n> \n> But I wonder, the encoding used to represent the characters in the\n> string I assume depends on some locale, do they ?\n\nstd::string has no concept of encoding. It's only an array of char. This\nfunction assumes that the input is encoded in UTF-8, and outputs a\nUTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string\ndoesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text\nis valid UTF-16 text.\n\nIt's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe\nto use UTF-16.\n\n> > + * \\param[in] str String to convert\n> \n> The string to convert\n> \n> > + * \\param[in] le Little-endian (false for Big-endian)\n> \n> The desired byte-endianess of the converted byte array.\n> \n> An enum would not hurt, but it's not strictly required.\n> \n> > + *\n> > + * \\return Byte array of UCS-2 representation of \\a str, without null-terminator\n> \n> While it is still not clear to me the distinction between UTF-16 and\n> UCS-2 and I get the two are actually converging over time, the\n> documentaion of std::mbrtoc16 explicitely mentions UTF-16.\n> \n> I guess it again depends on the encoding of \\a str (which again\n> depends on the selected locale ?)\n> \n> > + */\n> > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)\n> \n> I wonder why we use snake_case in utils ? maybe to mimic STL ?\n\nFor some functions, we reuse names defined elsewhere (for instance our\ncustom implementation of secure_getenv() for platforms that don't\nprovide it). For other functions, it's to mimick STL indeed (for\ninstance set_overlap, there's a std::set_union() and\nstd::set_intersection()). Then we have duration_to_timespec() and\ntime_point_to_string() that should probably be in camelCase.\n\nFor this function, I'd use cameraCase. The function should likely be\nnamed utf8ToUtf16 or something similar, to make it clear what the source\nencoding is.\n\nIf we want to keep this function a generator UTF-8 to UTF-16 convertor,\nI'd return a std::u16string instead of a std::vector<uint8_t>. This\nwould mean handling the endianness conversion in the caller. The\nalternative is to turn this into a UTF-8 to Exif-Unicode function, in\nwhich case you can keep the le argument, but I would then rename it to\nendianness and use the Exif endianness macros as values instead of\nmaking it a bool.\n\n> > +{\n> > +\tstd::mbstate_t state{};\n> > +\tchar16_t c16;\n> > +\tconst char *ptr = &str[0], *end = &str[0] + str.size();\n> \n> One variable per line and maybe\n>         const char *end = &str.back()\n\nback() isn't the same.\n\n\tconst char *ptr = str.data();\n\tconst char *end = ptr + str.size();\n\n> > +\n> > +\tstd::vector<uint8_t> ret;\n> \n> I would reserve str.size() * 2\n> \n> Even if I get it's not necessarly that every char in str gets expanded\n> to two bytes\n> \n> > +\twhile (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {\n> \n> std::mbrtoc16 ?\n> How come the compiler does not complain ?\n\nWith uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it\nshould be std::mbstate_t and std::mbrtoc16(). As we use the C headers,\nmbrtoc16() is fine, but the state variable above should be changed.\n\nThe C and C++ libraries are allowed to, but not required to, define\nfunctions and types in both the std namespace and the unqualified\nnamespace. It's unsafe to rely on this though.\n\n> > +\t\tif (rc == static_cast<size_t>(-2) ||\n> > +\t\t    rc == static_cast<size_t>(-1))\n> > +\t\t\tbreak;\n> > +\n> > +\t\tret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));\n> > +\t\tret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));\n> \n> I think you can avoid & 0xff as being ret an array of uint8_t c16 gets\n> automatically converted, does it ?\n> \n> > +\n> > +\t\tif (rc > 0)\n> > +\t\t\tptr += rc;\n> > +\t}\n> > +\n> > +\treturn ret;\n> > +}\n> > +\n> >  /**\n> >   * \\fn std::vector<typename T::key_type> map_keys(const T &map)\n> >   * \\brief Retrieve the keys of a std::map<>","headers":{"Return-Path":"<libcamera-devel-bounces@lists.libcamera.org>","X-Original-To":"parsemail@patchwork.libcamera.org","Delivered-To":"parsemail@patchwork.libcamera.org","Received":["from lancelot.ideasonboard.com (lancelot.ideasonboard.com\n\t[92.243.16.209])\n\tby patchwork.libcamera.org (Postfix) with ESMTPS id 337C2C3383\n\tfor <parsemail@patchwork.libcamera.org>;\n\tFri, 15 Jan 2021 17:01:54 +0000 (UTC)","from lancelot.ideasonboard.com (localhost [IPv6:::1])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTP id 02C7E680F5;\n\tFri, 15 Jan 2021 18:01:54 +0100 (CET)","from perceval.ideasonboard.com (perceval.ideasonboard.com\n\t[213.167.242.64])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTPS id 823C8680EF\n\tfor <libcamera-devel@lists.libcamera.org>;\n\tFri, 15 Jan 2021 18:01:53 +0100 (CET)","from pendragon.ideasonboard.com (62-78-145-57.bb.dnainternet.fi\n\t[62.78.145.57])\n\tby perceval.ideasonboard.com (Postfix) with ESMTPSA id 1746F527;\n\tFri, 15 Jan 2021 18:01:53 +0100 (CET)"],"Authentication-Results":"lancelot.ideasonboard.com;\n\tdkim=fail reason=\"signature verification failed\" (1024-bit key;\n\tunprotected) header.d=ideasonboard.com header.i=@ideasonboard.com\n\theader.b=\"bhaxzr4G\"; dkim-atps=neutral","DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/simple; d=ideasonboard.com;\n\ts=mail; t=1610730113;\n\tbh=koPS4v9f1HKZT56dczkz24svFYfHxoSTkHNgRly69BE=;\n\th=Date:From:To:Cc:Subject:References:In-Reply-To:From;\n\tb=bhaxzr4GWvT6RLwZYs/mFtwZ1B1iV5JyQYyd3Zkam1nbEOvjyZcXMHFbhjGtV0v1s\n\t6cYS0gjQbKlWgK3WTfdU+4bxRE0SeFygJzNE/p+GcmItsa6ngbOi80/qMLRxbBUTFf\n\tCGE258BhtWlOgxpwc1vlR0Z+vO0SiwyQ8zefs8QI=","Date":"Fri, 15 Jan 2021 19:01:35 +0200","From":"Laurent Pinchart <laurent.pinchart@ideasonboard.com>","To":"Paul Elder <paul.elder@ideasonboard.com>","Message-ID":"<YAHKbwPOfvDWmpH/@pendragon.ideasonboard.com>","References":"<20210114104035.302968-1-paul.elder@ideasonboard.com>\n\t<20210114104035.302968-2-paul.elder@ideasonboard.com>\n\t<20210115141953.cvla5ym32ypqhw2l@uno.localdomain>","MIME-Version":"1.0","Content-Disposition":"inline","In-Reply-To":"<20210115141953.cvla5ym32ypqhw2l@uno.localdomain>","Subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","X-BeenThere":"libcamera-devel@lists.libcamera.org","X-Mailman-Version":"2.1.29","Precedence":"list","List-Id":"<libcamera-devel.lists.libcamera.org>","List-Unsubscribe":"<https://lists.libcamera.org/options/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=unsubscribe>","List-Archive":"<https://lists.libcamera.org/pipermail/libcamera-devel/>","List-Post":"<mailto:libcamera-devel@lists.libcamera.org>","List-Help":"<mailto:libcamera-devel-request@lists.libcamera.org?subject=help>","List-Subscribe":"<https://lists.libcamera.org/listinfo/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=subscribe>","Cc":"libcamera-devel@lists.libcamera.org","Content-Type":"text/plain; charset=\"utf-8\"","Content-Transfer-Encoding":"base64","Errors-To":"libcamera-devel-bounces@lists.libcamera.org","Sender":"\"libcamera-devel\" <libcamera-devel-bounces@lists.libcamera.org>"}},{"id":14564,"web_url":"https://patchwork.libcamera.org/comment/14564/","msgid":"<20210118091711.g5l2a3b2au7t2vn7@uno.localdomain>","date":"2021-01-18T09:17:11","subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","submitter":{"id":3,"url":"https://patchwork.libcamera.org/api/people/3/","name":"Jacopo Mondi","email":"jacopo@jmondi.org"},"content":"Hi Laurent,\n\nOn Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote:\n> Hi Paul,\n>\n> Thank you for the patch.\n>\n> On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:\n> > Hi Paul,\n> >\n> >   I read a few things around, but character encoding seems a very\n> >   complex subject, so I mostly have minor comments here\n> >\n> > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:\n> > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but\n> >\n> > From what I've read, even referring to Unicode might be mis-leading as\n> > it includes a number of different encodings. Do the EXIF specification\n> > mention Unicode or any other more specific standard ?\n>\n> The Exif specification is fairly bad in that regard. It only says\n> \"unicode\", while what it means (or at least what is used in practice\n> today, I don't know if that's what was meant originally) is UCS-2. The\n> endianness is also not mentioned, and we believe it should match the\n> endianness of the TIFF container, but that's not documented.\n>\n>\n> > > are recommended to be in UCS-2. Add a function in utils to help with\n> > > this.\n>\n> Unicode is a standard that contains (among other things) a set of\n> characters with their respective code points (a number that identifies\n> the character) and a set of character encodings. Encodings define how\n> to encode a code point into bytes. UCS-2 is one particular encoding.\n> Saying \"can be in Unicode but are recommended to be in UCS-2\" is thus\n> not quite correct.\n>\n> > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>\n> > > ---\n> > >  include/libcamera/internal/utils.h |  2 ++\n> > >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++\n> > >  2 files changed, 32 insertions(+)\n> > >\n> > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h\n> > > index f08134af..aa9cc236 100644\n> > > --- a/include/libcamera/internal/utils.h\n> > > +++ b/include/libcamera/internal/utils.h\n> > > @@ -35,6 +35,8 @@ const char *basename(const char *path);\n> > >  char *secure_getenv(const char *name);\n> > >  std::string dirname(const std::string &path);\n> > >\n> > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);\n> > > +\n>\n> I'd move this to the EXIF class in the HAL. There's hardly any chance\n> that libcamera will need to convert a string to UCS-2.\n>\n> > >  template<typename T>\n> > >  std::vector<typename T::key_type> map_keys(const T &map)\n> > >  {\n> > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp\n> > > index e90375ae..89cb0f73 100644\n> > > --- a/src/libcamera/utils.cpp\n> > > +++ b/src/libcamera/utils.cpp\n> > > @@ -17,6 +17,7 @@\n> > >  #include <string.h>\n> > >  #include <sys/stat.h>\n> > >  #include <sys/types.h>\n> > > +#include <uchar.h>\n> > >  #include <unistd.h>\n> > >\n> > >  /**\n> > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)\n> > >  \treturn path.substr(0, pos + 1);\n> > >  }\n> > >\n> > > +/**\n> > > + * \\brief Convert string to byte array of UCS-2\n> >\n> > a string to a byte array of UCS-2 encoded code point\n> >\n> > But I wonder, the encoding used to represent the characters in the\n> > string I assume depends on some locale, do they ?\n>\n> std::string has no concept of encoding. It's only an array of char. This\n> function assumes that the input is encoded in UTF-8, and outputs a\n> UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string\n\nNot sure I get the \"concept of encoding\". Each char is an 8-bit value\nwhich represents a code point. Different encodings might associate the code\npoint to a different symbol.\n\nJust after the first statement you said \"UTF-8\" is assumed, so there's\nsome assumed encoding.\n\nAnyway, if the whole thing is put as \"Expand and UTF-8 string to UTF-16\"\nit sound a bit more clear.\n\n> doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text\n> is valid UTF-16 text.\n>\n> It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe\n> to use UTF-16.\n>\n> > > + * \\param[in] str String to convert\n> >\n> > The string to convert\n> >\n> > > + * \\param[in] le Little-endian (false for Big-endian)\n> >\n> > The desired byte-endianess of the converted byte array.\n> >\n> > An enum would not hurt, but it's not strictly required.\n> >\n> > > + *\n> > > + * \\return Byte array of UCS-2 representation of \\a str, without null-terminator\n> >\n> > While it is still not clear to me the distinction between UTF-16 and\n> > UCS-2 and I get the two are actually converging over time, the\n> > documentaion of std::mbrtoc16 explicitely mentions UTF-16.\n> >\n> > I guess it again depends on the encoding of \\a str (which again\n> > depends on the selected locale ?)\n> >\n> > > + */\n> > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)\n> >\n> > I wonder why we use snake_case in utils ? maybe to mimic STL ?\n>\n> For some functions, we reuse names defined elsewhere (for instance our\n> custom implementation of secure_getenv() for platforms that don't\n> provide it). For other functions, it's to mimick STL indeed (for\n> instance set_overlap, there's a std::set_union() and\n> std::set_intersection()). Then we have duration_to_timespec() and\n> time_point_to_string() that should probably be in camelCase.\n>\n> For this function, I'd use cameraCase. The function should likely be\n> named utf8ToUtf16 or something similar, to make it clear what the source\n> encoding is.\n>\n> If we want to keep this function a generator UTF-8 to UTF-16 convertor,\n> I'd return a std::u16string instead of a std::vector<uint8_t>. This\n> would mean handling the endianness conversion in the caller. The\n> alternative is to turn this into a UTF-8 to Exif-Unicode function, in\n> which case you can keep the le argument, but I would then rename it to\n> endianness and use the Exif endianness macros as values instead of\n> making it a bool.\n>\n> > > +{\n> > > +\tstd::mbstate_t state{};\n> > > +\tchar16_t c16;\n> > > +\tconst char *ptr = &str[0], *end = &str[0] + str.size();\n> >\n> > One variable per line and maybe\n> >         const char *end = &str.back()\n>\n> back() isn't the same.\n>\n\nUps\nstd::string::back() : equivalent to operator[](size() - 1)\n\n> \tconst char *ptr = str.data();\n> \tconst char *end = ptr + str.size();\n>\n> > > +\n> > > +\tstd::vector<uint8_t> ret;\n> >\n> > I would reserve str.size() * 2\n> >\n> > Even if I get it's not necessarly that every char in str gets expanded\n> > to two bytes\n> >\n> > > +\twhile (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {\n> >\n> > std::mbrtoc16 ?\n> > How come the compiler does not complain ?\n>\n> With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it\n> should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,\n> mbrtoc16() is fine, but the state variable above should be changed.\n\nUh, I assumed the C++ header having seen std::mbstate_t\n\n>\n> The C and C++ libraries are allowed to, but not required to, define\n> functions and types in both the std namespace and the unqualified\n> namespace. It's unsafe to rely on this though.\n>\n> > > +\t\tif (rc == static_cast<size_t>(-2) ||\n> > > +\t\t    rc == static_cast<size_t>(-1))\n> > > +\t\t\tbreak;\n> > > +\n> > > +\t\tret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));\n> > > +\t\tret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));\n> >\n> > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets\n> > automatically converted, does it ?\n> >\n> > > +\n> > > +\t\tif (rc > 0)\n> > > +\t\t\tptr += rc;\n> > > +\t}\n> > > +\n> > > +\treturn ret;\n> > > +}\n> > > +\n> > >  /**\n> > >   * \\fn std::vector<typename T::key_type> map_keys(const T &map)\n> > >   * \\brief Retrieve the keys of a std::map<>\n>\n> --\n> Regards,\n>\n> Laurent Pinchart","headers":{"Return-Path":"<libcamera-devel-bounces@lists.libcamera.org>","X-Original-To":"parsemail@patchwork.libcamera.org","Delivered-To":"parsemail@patchwork.libcamera.org","Received":["from lancelot.ideasonboard.com (lancelot.ideasonboard.com\n\t[92.243.16.209])\n\tby patchwork.libcamera.org (Postfix) with ESMTPS id A26EFC0F1C\n\tfor <parsemail@patchwork.libcamera.org>;\n\tMon, 18 Jan 2021 09:16:56 +0000 (UTC)","from lancelot.ideasonboard.com (localhost [IPv6:::1])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTP id F13C96810C;\n\tMon, 18 Jan 2021 10:16:55 +0100 (CET)","from relay1-d.mail.gandi.net (relay1-d.mail.gandi.net\n\t[217.70.183.193])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTPS id 1ABA460314\n\tfor <libcamera-devel@lists.libcamera.org>;\n\tMon, 18 Jan 2021 10:16:53 +0100 (CET)","from uno.localdomain (93-34-118-233.ip49.fastwebnet.it\n\t[93.34.118.233]) (Authenticated sender: jacopo@jmondi.org)\n\tby relay1-d.mail.gandi.net (Postfix) with ESMTPSA id 62FAC240005;\n\tMon, 18 Jan 2021 09:16:52 +0000 (UTC)"],"X-Originating-IP":"93.34.118.233","Date":"Mon, 18 Jan 2021 10:17:11 +0100","From":"Jacopo Mondi <jacopo@jmondi.org>","To":"Laurent Pinchart <laurent.pinchart@ideasonboard.com>","Message-ID":"<20210118091711.g5l2a3b2au7t2vn7@uno.localdomain>","References":"<20210114104035.302968-1-paul.elder@ideasonboard.com>\n\t<20210114104035.302968-2-paul.elder@ideasonboard.com>\n\t<20210115141953.cvla5ym32ypqhw2l@uno.localdomain>\n\t<YAHKbwPOfvDWmpH/@pendragon.ideasonboard.com>","MIME-Version":"1.0","Content-Disposition":"inline","In-Reply-To":"<YAHKbwPOfvDWmpH/@pendragon.ideasonboard.com>","Subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","X-BeenThere":"libcamera-devel@lists.libcamera.org","X-Mailman-Version":"2.1.29","Precedence":"list","List-Id":"<libcamera-devel.lists.libcamera.org>","List-Unsubscribe":"<https://lists.libcamera.org/options/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=unsubscribe>","List-Archive":"<https://lists.libcamera.org/pipermail/libcamera-devel/>","List-Post":"<mailto:libcamera-devel@lists.libcamera.org>","List-Help":"<mailto:libcamera-devel-request@lists.libcamera.org?subject=help>","List-Subscribe":"<https://lists.libcamera.org/listinfo/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=subscribe>","Cc":"libcamera-devel@lists.libcamera.org","Content-Type":"text/plain; charset=\"utf-8\"","Content-Transfer-Encoding":"base64","Errors-To":"libcamera-devel-bounces@lists.libcamera.org","Sender":"\"libcamera-devel\" <libcamera-devel-bounces@lists.libcamera.org>"}},{"id":14566,"web_url":"https://patchwork.libcamera.org/comment/14566/","msgid":"<YAVULJuZ4a/RWHYC@pendragon.ideasonboard.com>","date":"2021-01-18T09:26:04","subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","submitter":{"id":2,"url":"https://patchwork.libcamera.org/api/people/2/","name":"Laurent Pinchart","email":"laurent.pinchart@ideasonboard.com"},"content":"Hi Jacopo,\n\nOn Mon, Jan 18, 2021 at 10:17:11AM +0100, Jacopo Mondi wrote:\n> On Fri, Jan 15, 2021 at 07:01:35PM +0200, Laurent Pinchart wrote:\n> > On Fri, Jan 15, 2021 at 03:19:53PM +0100, Jacopo Mondi wrote:\n> > > Hi Paul,\n> > >\n> > >   I read a few things around, but character encoding seems a very\n> > >   complex subject, so I mostly have minor comments here\n> > >\n> > > On Thu, Jan 14, 2021 at 07:40:30PM +0900, Paul Elder wrote:\n> > > > GPSProcessingMethod and UserComment in EXIF tags can be in Unicode, but\n> > >\n> > > From what I've read, even referring to Unicode might be mis-leading as\n> > > it includes a number of different encodings. Do the EXIF specification\n> > > mention Unicode or any other more specific standard ?\n> >\n> > The Exif specification is fairly bad in that regard. It only says\n> > \"unicode\", while what it means (or at least what is used in practice\n> > today, I don't know if that's what was meant originally) is UCS-2. The\n> > endianness is also not mentioned, and we believe it should match the\n> > endianness of the TIFF container, but that's not documented.\n> >\n> > > > are recommended to be in UCS-2. Add a function in utils to help with\n> > > > this.\n> >\n> > Unicode is a standard that contains (among other things) a set of\n> > characters with their respective code points (a number that identifies\n> > the character) and a set of character encodings. Encodings define how\n> > to encode a code point into bytes. UCS-2 is one particular encoding.\n> > Saying \"can be in Unicode but are recommended to be in UCS-2\" is thus\n> > not quite correct.\n> >\n> > > > Signed-off-by: Paul Elder <paul.elder@ideasonboard.com>\n> > > > ---\n> > > >  include/libcamera/internal/utils.h |  2 ++\n> > > >  src/libcamera/utils.cpp            | 30 ++++++++++++++++++++++++++++++\n> > > >  2 files changed, 32 insertions(+)\n> > > >\n> > > > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h\n> > > > index f08134af..aa9cc236 100644\n> > > > --- a/include/libcamera/internal/utils.h\n> > > > +++ b/include/libcamera/internal/utils.h\n> > > > @@ -35,6 +35,8 @@ const char *basename(const char *path);\n> > > >  char *secure_getenv(const char *name);\n> > > >  std::string dirname(const std::string &path);\n> > > >\n> > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le);\n> > > > +\n> >\n> > I'd move this to the EXIF class in the HAL. There's hardly any chance\n> > that libcamera will need to convert a string to UCS-2.\n> >\n> > > >  template<typename T>\n> > > >  std::vector<typename T::key_type> map_keys(const T &map)\n> > > >  {\n> > > > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp\n> > > > index e90375ae..89cb0f73 100644\n> > > > --- a/src/libcamera/utils.cpp\n> > > > +++ b/src/libcamera/utils.cpp\n> > > > @@ -17,6 +17,7 @@\n> > > >  #include <string.h>\n> > > >  #include <sys/stat.h>\n> > > >  #include <sys/types.h>\n> > > > +#include <uchar.h>\n> > > >  #include <unistd.h>\n> > > >\n> > > >  /**\n> > > > @@ -122,6 +123,35 @@ std::string dirname(const std::string &path)\n> > > >  \treturn path.substr(0, pos + 1);\n> > > >  }\n> > > >\n> > > > +/**\n> > > > + * \\brief Convert string to byte array of UCS-2\n> > >\n> > > a string to a byte array of UCS-2 encoded code point\n> > >\n> > > But I wonder, the encoding used to represent the characters in the\n> > > string I assume depends on some locale, do they ?\n> >\n> > std::string has no concept of encoding. It's only an array of char. This\n> > function assumes that the input is encoded in UTF-8, and outputs a\n> > UTF-16 string. UCS-2 is a precursor to UTF-16, and as long as a string\n> \n> Not sure I get the \"concept of encoding\". Each char is an 8-bit value\n> which represents a code point. Different encodings might associate the code\n> point to a different symbol.\n\nIn unicode a code point is a 32-bit value. The encoding then defines how\nthe code point is encoded in bytes (in a variable number of bytes for\nUTF-8, a single byte that can only represent a subset of unicode for\nISO-8859-*, a variable number of u16 for UTF-16, a single u16 that can\nonly represent a subdev for unicode for UCS-2, ...). An std::string\ndoesn't specify any encoding. As it stores a string as a null-terminated\narray of char, any char-based encoding is usable (UTF-8, ISO-8859-*,\n...) but not \"wide\" char-based encodings (UTF-16, UCS-2, ...).\n\n> Just after the first statement you said \"UTF-8\" is assumed, so there's\n> some assumed encoding.\n> \n> Anyway, if the whole thing is put as \"Expand and UTF-8 string to UTF-16\"\n> it sound a bit more clear.\n> \n> > doesn't contain any code point in the range U+D800–U+DFFF, a UCS-2 text\n> > is valid UTF-16 text.\n> >\n> > It's not clear whether Exif uses UCS-2 or UTF-16, and I think it's safe\n> > to use UTF-16.\n> >\n> > > > + * \\param[in] str String to convert\n> > >\n> > > The string to convert\n> > >\n> > > > + * \\param[in] le Little-endian (false for Big-endian)\n> > >\n> > > The desired byte-endianess of the converted byte array.\n> > >\n> > > An enum would not hurt, but it's not strictly required.\n> > >\n> > > > + *\n> > > > + * \\return Byte array of UCS-2 representation of \\a str, without null-terminator\n> > >\n> > > While it is still not clear to me the distinction between UTF-16 and\n> > > UCS-2 and I get the two are actually converging over time, the\n> > > documentaion of std::mbrtoc16 explicitely mentions UTF-16.\n> > >\n> > > I guess it again depends on the encoding of \\a str (which again\n> > > depends on the selected locale ?)\n> > >\n> > > > + */\n> > > > +std::vector<uint8_t> string_to_c16(const std::string &str, bool le)\n> > >\n> > > I wonder why we use snake_case in utils ? maybe to mimic STL ?\n> >\n> > For some functions, we reuse names defined elsewhere (for instance our\n> > custom implementation of secure_getenv() for platforms that don't\n> > provide it). For other functions, it's to mimick STL indeed (for\n> > instance set_overlap, there's a std::set_union() and\n> > std::set_intersection()). Then we have duration_to_timespec() and\n> > time_point_to_string() that should probably be in camelCase.\n> >\n> > For this function, I'd use cameraCase. The function should likely be\n> > named utf8ToUtf16 or something similar, to make it clear what the source\n> > encoding is.\n> >\n> > If we want to keep this function a generator UTF-8 to UTF-16 convertor,\n> > I'd return a std::u16string instead of a std::vector<uint8_t>. This\n> > would mean handling the endianness conversion in the caller. The\n> > alternative is to turn this into a UTF-8 to Exif-Unicode function, in\n> > which case you can keep the le argument, but I would then rename it to\n> > endianness and use the Exif endianness macros as values instead of\n> > making it a bool.\n> >\n> > > > +{\n> > > > +\tstd::mbstate_t state{};\n> > > > +\tchar16_t c16;\n> > > > +\tconst char *ptr = &str[0], *end = &str[0] + str.size();\n> > >\n> > > One variable per line and maybe\n> > >         const char *end = &str.back()\n> >\n> > back() isn't the same.\n> \n> Ups\n> std::string::back() : equivalent to operator[](size() - 1)\n> \n> > \tconst char *ptr = str.data();\n> > \tconst char *end = ptr + str.size();\n> >\n> > > > +\n> > > > +\tstd::vector<uint8_t> ret;\n> > >\n> > > I would reserve str.size() * 2\n> > >\n> > > Even if I get it's not necessarly that every char in str gets expanded\n> > > to two bytes\n> > >\n> > > > +\twhile (size_t rc = mbrtoc16(&c16, ptr, end - ptr + 1, &state)) {\n> > >\n> > > std::mbrtoc16 ?\n> > > How come the compiler does not complain ?\n> >\n> > With uchar.h, it should be mbrtoc16(), and mbstate_t. With cuchar it\n> > should be std::mbstate_t and std::mbrtoc16(). As we use the C headers,\n> > mbrtoc16() is fine, but the state variable above should be changed.\n> \n> Uh, I assumed the C++ header having seen std::mbstate_t\n> \n> > The C and C++ libraries are allowed to, but not required to, define\n> > functions and types in both the std namespace and the unqualified\n> > namespace. It's unsafe to rely on this though.\n> >\n> > > > +\t\tif (rc == static_cast<size_t>(-2) ||\n> > > > +\t\t    rc == static_cast<size_t>(-1))\n> > > > +\t\t\tbreak;\n> > > > +\n> > > > +\t\tret.push_back(le ? (c16 & 0xff) : ((c16 >> 8) & 0xff));\n> > > > +\t\tret.push_back(le ? ((c16 >> 8) & 0xff) : (c16 & 0xff));\n> > >\n> > > I think you can avoid & 0xff as being ret an array of uint8_t c16 gets\n> > > automatically converted, does it ?\n> > >\n> > > > +\n> > > > +\t\tif (rc > 0)\n> > > > +\t\t\tptr += rc;\n> > > > +\t}\n> > > > +\n> > > > +\treturn ret;\n> > > > +}\n> > > > +\n> > > >  /**\n> > > >   * \\fn std::vector<typename T::key_type> map_keys(const T &map)\n> > > >   * \\brief Retrieve the keys of a std::map<>","headers":{"Return-Path":"<libcamera-devel-bounces@lists.libcamera.org>","X-Original-To":"parsemail@patchwork.libcamera.org","Delivered-To":"parsemail@patchwork.libcamera.org","Received":["from lancelot.ideasonboard.com (lancelot.ideasonboard.com\n\t[92.243.16.209])\n\tby patchwork.libcamera.org (Postfix) with ESMTPS id AC829C0F1C\n\tfor <parsemail@patchwork.libcamera.org>;\n\tMon, 18 Jan 2021 09:26:23 +0000 (UTC)","from lancelot.ideasonboard.com (localhost [IPv6:::1])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTP id 3C7CC6810D;\n\tMon, 18 Jan 2021 10:26:23 +0100 (CET)","from perceval.ideasonboard.com (perceval.ideasonboard.com\n\t[IPv6:2001:4b98:dc2:55:216:3eff:fef7:d647])\n\tby lancelot.ideasonboard.com (Postfix) with ESMTPS id 4E20860314\n\tfor <libcamera-devel@lists.libcamera.org>;\n\tMon, 18 Jan 2021 10:26:21 +0100 (CET)","from pendragon.ideasonboard.com (62-78-145-57.bb.dnainternet.fi\n\t[62.78.145.57])\n\tby perceval.ideasonboard.com (Postfix) with ESMTPSA id 9C8472BB;\n\tMon, 18 Jan 2021 10:26:20 +0100 (CET)"],"Authentication-Results":"lancelot.ideasonboard.com;\n\tdkim=fail reason=\"signature verification failed\" (1024-bit key;\n\tunprotected) header.d=ideasonboard.com header.i=@ideasonboard.com\n\theader.b=\"KZYXCVg6\"; dkim-atps=neutral","DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/simple; d=ideasonboard.com;\n\ts=mail; t=1610961980;\n\tbh=e/LQVD7cxnT1SGcfPDoH+E/z364qnNGwuhsMzhcaR7A=;\n\th=Date:From:To:Cc:Subject:References:In-Reply-To:From;\n\tb=KZYXCVg6zibcHP138J+LdC1xcTWm1smZXsVY294vZAoCnDsgAby7lmgXtJEv/HRON\n\txs2SWDId8brvyRxZV8Z7pLbqDAe9ijd/RhQsn5rNsrJZ58W29uhT8UYHOxT2/0HHge\n\t1bPlyD9XfxLk+lIlWfk00usiSJZkHI2l2lKtusjA=","Date":"Mon, 18 Jan 2021 11:26:04 +0200","From":"Laurent Pinchart <laurent.pinchart@ideasonboard.com>","To":"Jacopo Mondi <jacopo@jmondi.org>","Message-ID":"<YAVULJuZ4a/RWHYC@pendragon.ideasonboard.com>","References":"<20210114104035.302968-1-paul.elder@ideasonboard.com>\n\t<20210114104035.302968-2-paul.elder@ideasonboard.com>\n\t<20210115141953.cvla5ym32ypqhw2l@uno.localdomain>\n\t<YAHKbwPOfvDWmpH/@pendragon.ideasonboard.com>\n\t<20210118091711.g5l2a3b2au7t2vn7@uno.localdomain>","MIME-Version":"1.0","Content-Disposition":"inline","In-Reply-To":"<20210118091711.g5l2a3b2au7t2vn7@uno.localdomain>","Subject":"Re: [libcamera-devel] [PATCH 1/6] utils: Add function to convert\n\tstring to UCS-2","X-BeenThere":"libcamera-devel@lists.libcamera.org","X-Mailman-Version":"2.1.29","Precedence":"list","List-Id":"<libcamera-devel.lists.libcamera.org>","List-Unsubscribe":"<https://lists.libcamera.org/options/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=unsubscribe>","List-Archive":"<https://lists.libcamera.org/pipermail/libcamera-devel/>","List-Post":"<mailto:libcamera-devel@lists.libcamera.org>","List-Help":"<mailto:libcamera-devel-request@lists.libcamera.org?subject=help>","List-Subscribe":"<https://lists.libcamera.org/listinfo/libcamera-devel>,\n\t<mailto:libcamera-devel-request@lists.libcamera.org?subject=subscribe>","Cc":"libcamera-devel@lists.libcamera.org","Content-Type":"text/plain; charset=\"utf-8\"","Content-Transfer-Encoding":"base64","Errors-To":"libcamera-devel-bounces@lists.libcamera.org","Sender":"\"libcamera-devel\" <libcamera-devel-bounces@lists.libcamera.org>"}}]