[libcamera-devel,v4,2/7] libcamera: utils: Add method to strip Unicode characters

Message ID 20200813223722.4050835-3-niklas.soderlund@ragnatech.se
State Superseded
Delegated to: Niklas Söderlund
Headers show
Series
  • libcamera: Allow for user-friendly names in applications
Related show

Commit Message

Niklas Söderlund Aug. 13, 2020, 10:37 p.m. UTC
Add method that strips non-ASCII characters from a string.

Signed-off-by: Niklas Söderlund <niklas.soderlund@ragnatech.se>
---
* Changes since v3
- Fix spelling in comment.
- Rename to toAscii()
---
 include/libcamera/internal/utils.h |  2 ++
 src/libcamera/utils.cpp            | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+)

Comments

Kieran Bingham Aug. 14, 2020, 8:28 a.m. UTC | #1
Hi Niklas,

On 13/08/2020 23:37, Niklas Söderlund wrote:
> Add method that strips non-ASCII characters from a string.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund@ragnatech.se>

I still would have thought the isprint might be more useful, as I don't
think we want chars like newlines/or console control characthers getting
through these strings.

But if you don't think that's an issue here:

Reviewed-by: Kieran Bingham <kieran.bingham@ideasonboard.com>

I don't want to block the series, and this is an internal helper so it
can always be updated if we hit any issue.



> ---
> * Changes since v3
> - Fix spelling in comment.
> - Rename to toAscii()
> ---
>  include/libcamera/internal/utils.h |  2 ++
>  src/libcamera/utils.cpp            | 21 +++++++++++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> index 45cd6f120c51586b..b27f5a2323552058 100644
> --- a/include/libcamera/internal/utils.h
> +++ b/include/libcamera/internal/utils.h
> @@ -197,6 +197,8 @@ private:
>  
>  details::StringSplitter split(const std::string &str, const std::string &delim);
>  
> +std::string toAscii(const std::string &str);
> +
>  std::string libcameraBuildPath();
>  std::string libcameraSourcePath();
>  
> diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> index 615df46ac142a2a9..726b84bfbae53ff2 100644
> --- a/src/libcamera/utils.cpp
> +++ b/src/libcamera/utils.cpp
> @@ -342,6 +342,27 @@ details::StringSplitter split(const std::string &str, const std::string &delim)
>  	return details::StringSplitter(str, delim);
>  }
>  
> +/**
> + * \brief Strip all Unicode characters from a string
> + * \param[in] str The string to strip
> + *
> + * Strip all non-ASCII characters form a string. A Unicode character that spans
> + * multiple bytes (and therefore is not also an ASCII character) may be
> + * identified by the fact that its most significant bit is always set.
> + *
> + * \todo When switching to C++ 20 use std::remove_if.
> + *
> + * \return An ASCII string
> + */
> +std::string toAscii(const std::string &str)
> +{
> +	std::string ret;
> +	for (const char &c : str)
> +		if (!(c & 0x80))
> +			ret += c;
> +	return ret;
> +}
> +
>  /**
>   * \brief Check if libcamera is installed or not
>   *
>
Niklas Söderlund Aug. 14, 2020, 10:42 a.m. UTC | #2
Hi Kieran,

On 2020-08-14 09:28:59 +0100, Kieran Bingham wrote:
> Hi Niklas,
> 
> On 13/08/2020 23:37, Niklas Söderlund wrote:
> > Add method that strips non-ASCII characters from a string.
> > 
> > Signed-off-by: Niklas Söderlund <niklas.soderlund@ragnatech.se>
> 
> I still would have thought the isprint might be more useful, as I don't
> think we want chars like newlines/or console control characthers getting
> through these strings.

I thought about it but if we used std::isprint() this would not produce 
an ASCII string, instead it would produce an printable string :-) I read 
some post (which I can't find now) that spoke agains using 
std::isprint() to drop non-ASCII charters, but it might have been 
targeted on proprietary OS implementations...

As you say, this is an internal helper and we can switch to 
std::isprint() later if need.

> 
> But if you don't think that's an issue here:
> 
> Reviewed-by: Kieran Bingham <kieran.bingham@ideasonboard.com>

Thanks!

> 
> I don't want to block the series, and this is an internal helper so it
> can always be updated if we hit any issue.
> 
> 
> 
> > ---
> > * Changes since v3
> > - Fix spelling in comment.
> > - Rename to toAscii()
> > ---
> >  include/libcamera/internal/utils.h |  2 ++
> >  src/libcamera/utils.cpp            | 21 +++++++++++++++++++++
> >  2 files changed, 23 insertions(+)
> > 
> > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > index 45cd6f120c51586b..b27f5a2323552058 100644
> > --- a/include/libcamera/internal/utils.h
> > +++ b/include/libcamera/internal/utils.h
> > @@ -197,6 +197,8 @@ private:
> >  
> >  details::StringSplitter split(const std::string &str, const std::string &delim);
> >  
> > +std::string toAscii(const std::string &str);
> > +
> >  std::string libcameraBuildPath();
> >  std::string libcameraSourcePath();
> >  
> > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > index 615df46ac142a2a9..726b84bfbae53ff2 100644
> > --- a/src/libcamera/utils.cpp
> > +++ b/src/libcamera/utils.cpp
> > @@ -342,6 +342,27 @@ details::StringSplitter split(const std::string &str, const std::string &delim)
> >  	return details::StringSplitter(str, delim);
> >  }
> >  
> > +/**
> > + * \brief Strip all Unicode characters from a string
> > + * \param[in] str The string to strip
> > + *
> > + * Strip all non-ASCII characters form a string. A Unicode character that spans
> > + * multiple bytes (and therefore is not also an ASCII character) may be
> > + * identified by the fact that its most significant bit is always set.
> > + *
> > + * \todo When switching to C++ 20 use std::remove_if.
> > + *
> > + * \return An ASCII string
> > + */
> > +std::string toAscii(const std::string &str)
> > +{
> > +	std::string ret;
> > +	for (const char &c : str)
> > +		if (!(c & 0x80))
> > +			ret += c;
> > +	return ret;
> > +}
> > +
> >  /**
> >   * \brief Check if libcamera is installed or not
> >   *
> > 
> 
> -- 
> Regards
> --
> Kieran
Laurent Pinchart Aug. 16, 2020, 11:58 a.m. UTC | #3
On Fri, Aug 14, 2020 at 12:37:17AM +0200, Niklas Söderlund wrote:
> Add method that strips non-ASCII characters from a string.
> 
> Signed-off-by: Niklas Söderlund <niklas.soderlund@ragnatech.se>
> ---
> * Changes since v3
> - Fix spelling in comment.
> - Rename to toAscii()
> ---
>  include/libcamera/internal/utils.h |  2 ++
>  src/libcamera/utils.cpp            | 21 +++++++++++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> index 45cd6f120c51586b..b27f5a2323552058 100644
> --- a/include/libcamera/internal/utils.h
> +++ b/include/libcamera/internal/utils.h
> @@ -197,6 +197,8 @@ private:
>  
>  details::StringSplitter split(const std::string &str, const std::string &delim);
>  
> +std::string toAscii(const std::string &str);
> +
>  std::string libcameraBuildPath();
>  std::string libcameraSourcePath();
>  
> diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> index 615df46ac142a2a9..726b84bfbae53ff2 100644
> --- a/src/libcamera/utils.cpp
> +++ b/src/libcamera/utils.cpp
> @@ -342,6 +342,27 @@ details::StringSplitter split(const std::string &str, const std::string &delim)
>  	return details::StringSplitter(str, delim);
>  }
>  
> +/**
> + * \brief Strip all Unicode characters from a string
> + * \param[in] str The string to strip
> + *
> + * Strip all non-ASCII characters form a string. A Unicode character that spans
> + * multiple bytes (and therefore is not also an ASCII character) may be

"Unicode character" refers to the code points, while "spans multiply
bytes" refers to encodings. As std::string has no notion of encoding,
this documentation should tell what the expected input encoding is.

> + * identified by the fact that its most significant bit is always set.

Open question, do we want to remove non-ASCII characters, or replaced
them (with a ".", "?" or something else) ?

> + *
> + * \todo When switching to C++ 20 use std::remove_if.

That would be less efficient :-)

> + *
> + * \return An ASCII string
> + */
> +std::string toAscii(const std::string &str)

toAscii() makes it sound the function converts the string, while it
really removes characters.

> +{
> +	std::string ret;

	ret.reserve(str.size());

> +	for (const char &c : str)
> +		if (!(c & 0x80))
> +			ret += c;
> +	return ret;
> +}
> +
>  /**
>   * \brief Check if libcamera is installed or not
>   *
Niklas Söderlund Aug. 16, 2020, 12:29 p.m. UTC | #4
Hi Laurent,

Thanks for your feedback.

On 2020-08-16 14:58:13 +0300, Laurent Pinchart wrote:
> On Fri, Aug 14, 2020 at 12:37:17AM +0200, Niklas Söderlund wrote:
> > Add method that strips non-ASCII characters from a string.
> > 
> > Signed-off-by: Niklas Söderlund <niklas.soderlund@ragnatech.se>
> > ---
> > * Changes since v3
> > - Fix spelling in comment.
> > - Rename to toAscii()
> > ---
> >  include/libcamera/internal/utils.h |  2 ++
> >  src/libcamera/utils.cpp            | 21 +++++++++++++++++++++
> >  2 files changed, 23 insertions(+)
> > 
> > diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
> > index 45cd6f120c51586b..b27f5a2323552058 100644
> > --- a/include/libcamera/internal/utils.h
> > +++ b/include/libcamera/internal/utils.h
> > @@ -197,6 +197,8 @@ private:
> >  
> >  details::StringSplitter split(const std::string &str, const std::string &delim);
> >  
> > +std::string toAscii(const std::string &str);
> > +
> >  std::string libcameraBuildPath();
> >  std::string libcameraSourcePath();
> >  
> > diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
> > index 615df46ac142a2a9..726b84bfbae53ff2 100644
> > --- a/src/libcamera/utils.cpp
> > +++ b/src/libcamera/utils.cpp
> > @@ -342,6 +342,27 @@ details::StringSplitter split(const std::string &str, const std::string &delim)
> >  	return details::StringSplitter(str, delim);
> >  }
> >  
> > +/**
> > + * \brief Strip all Unicode characters from a string
> > + * \param[in] str The string to strip
> > + *
> > + * Strip all non-ASCII characters form a string. A Unicode character that spans
> > + * multiple bytes (and therefore is not also an ASCII character) may be
> 
> "Unicode character" refers to the code points, while "spans multiply
> bytes" refers to encodings. As std::string has no notion of encoding,
> this documentation should tell what the expected input encoding is.
> 
> > + * identified by the fact that its most significant bit is always set.
> 
> Open question, do we want to remove non-ASCII characters, or replaced
> them (with a ".", "?" or something else) ?

I have not strong opinion, dropping them feels more natural to me but 
I'm open to substitution. What do other think?

> 
> > + *
> > + * \todo When switching to C++ 20 use std::remove_if.
> 
> That would be less efficient :-)
> 
> > + *
> > + * \return An ASCII string
> > + */
> > +std::string toAscii(const std::string &str)
> 
> toAscii() makes it sound the function converts the string, while it
> really removes characters.

I originally called this stripUnicode() would that name work better for 
you?

> 
> > +{
> > +	std::string ret;
> 
> 	ret.reserve(str.size());
> 
> > +	for (const char &c : str)
> > +		if (!(c & 0x80))
> > +			ret += c;
> > +	return ret;
> > +}
> > +
> >  /**
> >   * \brief Check if libcamera is installed or not
> >   *
> 
> -- 
> Regards,
> 
> Laurent Pinchart

Patch

diff --git a/include/libcamera/internal/utils.h b/include/libcamera/internal/utils.h
index 45cd6f120c51586b..b27f5a2323552058 100644
--- a/include/libcamera/internal/utils.h
+++ b/include/libcamera/internal/utils.h
@@ -197,6 +197,8 @@  private:
 
 details::StringSplitter split(const std::string &str, const std::string &delim);
 
+std::string toAscii(const std::string &str);
+
 std::string libcameraBuildPath();
 std::string libcameraSourcePath();
 
diff --git a/src/libcamera/utils.cpp b/src/libcamera/utils.cpp
index 615df46ac142a2a9..726b84bfbae53ff2 100644
--- a/src/libcamera/utils.cpp
+++ b/src/libcamera/utils.cpp
@@ -342,6 +342,27 @@  details::StringSplitter split(const std::string &str, const std::string &delim)
 	return details::StringSplitter(str, delim);
 }
 
+/**
+ * \brief Strip all Unicode characters from a string
+ * \param[in] str The string to strip
+ *
+ * Strip all non-ASCII characters form a string. A Unicode character that spans
+ * multiple bytes (and therefore is not also an ASCII character) may be
+ * identified by the fact that its most significant bit is always set.
+ *
+ * \todo When switching to C++ 20 use std::remove_if.
+ *
+ * \return An ASCII string
+ */
+std::string toAscii(const std::string &str)
+{
+	std::string ret;
+	for (const char &c : str)
+		if (!(c & 0x80))
+			ret += c;
+	return ret;
+}
+
 /**
  * \brief Check if libcamera is installed or not
  *