[RFC,1/8] Documentation: coding-style: Document usage of classes for strings
diff mbox series

Message ID 20241215230206.11002-2-laurent.pinchart@ideasonboard.com
State New
Headers show
Series
  • libcamera: Use std::string_view
Related show

Commit Message

Laurent Pinchart Dec. 15, 2024, 11:01 p.m. UTC
C++ has three main ways to handle strings: std::string, std::string_view
and C-style strings. Document their pros and cons and the rules
governing their usage in libcamera.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
---
 Documentation/coding-style.rst | 98 ++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

Comments

Barnabás Pőcze Dec. 16, 2024, 12:32 a.m. UTC | #1
Hi


2024. december 16., hétfő 0:01 keltezéssel, Laurent Pinchart <laurent.pinchart@ideasonboard.com> írta:

> C++ has three main ways to handle strings: std::string, std::string_view
> and C-style strings. Document their pros and cons and the rules
> governing their usage in libcamera.
> 
> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
> ---
>  Documentation/coding-style.rst | 98 ++++++++++++++++++++++++++++++++++
>  1 file changed, 98 insertions(+)
> 
> diff --git a/Documentation/coding-style.rst b/Documentation/coding-style.rst
> index 6ac3a4a0d5175118..dc4b093b58a99006 100644
> --- a/Documentation/coding-style.rst
> +++ b/Documentation/coding-style.rst
> @@ -269,6 +269,104 @@ the compiler select the right function. This avoids potential errors such as
>  calling abs(int) with a float argument, performing an unwanted implicit integer
>  conversion. For this reason, cmath is preferred over math.h.
> 
> +Strings
> +~~~~~~~
> +
> +This section focusses on strings as a sequence of characters represented by the

focuses


> +`char` type, as those are the only strings used in libcamera. The principles
> +are however equally applicable to other types of strings.
> +
> +C++ includes multiple standard ways to represent and handle strings. Each of
> +them have pros and cons and different usage patterns.
> +
> +1. C-style strings
> +
> +   C-style strings are null-terminated arrays of characters of type `char`.
> +   They are represented by a `char *` pointing to the first character. C string
> +   literals (`"Hello world!"`) produce read-only global null-terminated arrays
> +   of type `const char`.
> +
> +   Handling strings through `char *` relies on assumptions that are not
> +   enforced at compile time: the memory must not be freed as long as the
> +   pointer remains valid, and the string must be null-terminated. This causes a
> +   risk of use-after-free or buffer out-of-bounds reads. Furthermore, as the
> +   size of the underlying memory is not bundled with the pointer, handling
> +   writable strings as a `char *` risks buffer out-of-bounds writes.
> +
> +2. `std::string` class
> +
> +   The `std::string` class is the main C++ data type to represent strings. The
> +   class holds the string data as a null-terminated array of characters, as
> +   well as the length of the array. `std::string` literals (`"Hello world!"s`)
> +   converts the character array literal into a temporary `std::string` instance
> +   whose lifetime ends with the statement.
> +
> +   Usage of `std::string` makes string handling safer and convenient thanks to
> +   the integration with the rest of the C++ standard library. This comes at a
> +   cost, as making copies of the class copies the underlying data, requiring
> +   dynamic allocation of memory. This is partially offset by usage of move
> +   constructors or assignment operators.

I suppose SSO could be mentioned as well.


> +
> +3. `std::string_view` class
> +
> +   This newer addition to the C++ standard library is a non-owning reference to
> +   a characters array. Like the `std::string` class, the `std::string_view`
> +   stores a pointer to the array and a length, but unlike C-style strings and
> +   the `std::string` class, the array is not guaranteed to be null-terminated.
> +   `std::string_view` literals (`"Hello world!"sv`) create a temporary
> +   `std::string_view` instance whose lifetime ends with the statement, but the
> +   underlying character array has global static storage and lifetime.
> +
> +   String views are useful to represent in a single way strings with
> +   heterogenous underlying storage, as they can be constructed from a `char *`
> +   or a `std::string`. They reduce the risk of buffer out-of-bounds accesses as
> +   they carry the array length, and they speed up handling of substrings as
> +   they don't cause copies, unlike `std::string`. As the string views store a
> +   borrowed reference to the underlying storage, they are prone to
> +   use-after-free issues. The lack of a null terminator furthermore makes them
> +   unsafe to handle strings that need to be passed to C functions that assume
> +   null-terminated strings (such as most of the C standard library functions).
> +
> +   C++17 lacks many functions that make `std::string_view` usage convenient.
> +   This includes `operator+()` to concatenate a `std::string` and a
> +   `std::string_view`, as well as heterogenous lookup functions for containers
> +   (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2363r5.html).
> +   libcamera works around the former by defining the missing operators, and the
> +   latter by using the `find()` function instead of `at()` or `operator[]()`.
> +   Those issues are properly addressed in C++26.

Indeed, this is a very unfortunate thing, at least `std::map::find()` supports it since C++14.


> +
> +With those pros, cons are caveats in mind, libcamera follows a set of rules
> +that governs selection of appropriate string types for function arguments:
> +
> +* If the function is internal to a compilation unit, and all callers use the
> +  same string type for the function argument, use that type (by reference or
> +  value, as appropriate).

This seems not very easy to enforce, no? Or rather, it seems hard to enforce
it in the presence of code changes. I would think it more likely that people
will adapt to the existing interface and not spend much time considering this
aspect (rightfully so).


> +* If the function needs to modify the string, make a copy, or convert it to a
> +  `std::string` for any other reason, pass a `std::string` by value. Use
> +  `std::move()` in the caller if the string is not needed after the function
> +  call, as well as inside the function to move the string to a local copy if
> +  needed. This minimizes the number of copies in all use cases.

I have recently noticed that the `modernize-pass-by-value` clang-tidy check
can diagnose some occurrences of this (not just strings, naturally).
(https://clang.llvm.org/extra/clang-tidy/checks/modernize/pass-by-value.html)


> +* If the function needs to pass the string to a C function that takes a
> +  `const char *`, and the caller may reasonably use a C string literal, pass a
> +  `const char *`. This avoids copies when the caller uses a `char *` pointer or
> +  a C string literal, and avoids buffer out-of-bound reads that could occur with
> +  a non null-terminated `std::string_view`.
> +* If the function only needs to pass the string to other functions that take a
> +  `const std::string &` reference, pass a `const std::string &`. This minimizes
> +  the construction of `std::string` instances when the caller already holds an
> +  instance.
> +* If the function only reads from the string, doesn't rely on it being
> +  null-terminated, and can reasonably be called with either C strings, C string
> +  literals, or `std::string` instances, pass a `std::string_view` instance. This
> +  improves safety of the code without impacting performance.
> +* Do not use `std::string_view` in public API function parameters. Previous
> +  rules showed that selection of the optimal string type for function
> +  parameters depends on the internal implementation of the function, which
> +  could change over time. This conflicts with the stability requirements of the
> +  libcamera public API. Furthermore, as C++17 makes `std::string_view` usage
> +  inconvenient in many cases, we don't want to force that inconvenience upon
> +  libcamera users.

It is unfortunate how much null terminated strings permeate everything...

First, as for inconvenience, both `const char *` and `std::string` are implicitly
convertible to `std::string_view`, so I don't foresee many issues in the context
of simple function calls (e.g. `CameraManager::get()` class of functions). If
the libcamera API never returns `std::string_view`, then the user only needs to
interact with it minimally (and those cases are likely taken care of by implicit
conversions).

Second, if we ignore the existence of  NUL terminated strings and APIs using it for
a moment, I think it is clear that `std::string_view` wins over `const std::string&`
in the vast majority of cases for two main reasons (explained above already in detail)
(assuming that the implementation would work with both, of course):

  * lightweight, can usually be passed in registers
  * does not require the caller to have a `const std::string&` already

Third, going back to reality where NUL terminated strings still exist, considering
the choice between `const std::string&` and `std::string_view`, the most problematic
internal implementation change is that the function starts needing NUL terminated
strings. This is undoubtedly a concern, but I believe that in many cases
it is possible to estimate the likelihood and make a decision based on that.

For example, in the case of `CameraManager::get()` I would consider the likelihood
of such future change to be negligible, and thus the usage of `std::string_view`
acceptable. So I would like to think such a decision could be made on a case-by-case basis.

And even if such a change happens, in the worst case scenario, a `const char *` and/or
a `const std::string&` overload can be added while preserving API and ABI compatibility.

All in all, I tend to agree with most points written here, but I am not sure a
blanket ban on `std::string_view` in the public API is the optimal approach.


Regards,
Barnabás Pőcze


> +
> 
>  Documentation
>  -------------
> --
> Regards,
> 
> Laurent Pinchart
>

Patch
diff mbox series

diff --git a/Documentation/coding-style.rst b/Documentation/coding-style.rst
index 6ac3a4a0d5175118..dc4b093b58a99006 100644
--- a/Documentation/coding-style.rst
+++ b/Documentation/coding-style.rst
@@ -269,6 +269,104 @@  the compiler select the right function. This avoids potential errors such as
 calling abs(int) with a float argument, performing an unwanted implicit integer
 conversion. For this reason, cmath is preferred over math.h.
 
+Strings
+~~~~~~~
+
+This section focusses on strings as a sequence of characters represented by the
+`char` type, as those are the only strings used in libcamera. The principles
+are however equally applicable to other types of strings.
+
+C++ includes multiple standard ways to represent and handle strings. Each of
+them have pros and cons and different usage patterns.
+
+1. C-style strings
+
+   C-style strings are null-terminated arrays of characters of type `char`.
+   They are represented by a `char *` pointing to the first character. C string
+   literals (`"Hello world!"`) produce read-only global null-terminated arrays
+   of type `const char`.
+
+   Handling strings through `char *` relies on assumptions that are not
+   enforced at compile time: the memory must not be freed as long as the
+   pointer remains valid, and the string must be null-terminated. This causes a
+   risk of use-after-free or buffer out-of-bounds reads. Furthermore, as the
+   size of the underlying memory is not bundled with the pointer, handling
+   writable strings as a `char *` risks buffer out-of-bounds writes.
+
+2. `std::string` class
+
+   The `std::string` class is the main C++ data type to represent strings. The
+   class holds the string data as a null-terminated array of characters, as
+   well as the length of the array. `std::string` literals (`"Hello world!"s`)
+   converts the character array literal into a temporary `std::string` instance
+   whose lifetime ends with the statement.
+
+   Usage of `std::string` makes string handling safer and convenient thanks to
+   the integration with the rest of the C++ standard library. This comes at a
+   cost, as making copies of the class copies the underlying data, requiring
+   dynamic allocation of memory. This is partially offset by usage of move
+   constructors or assignment operators.
+
+3. `std::string_view` class
+
+   This newer addition to the C++ standard library is a non-owning reference to
+   a characters array. Like the `std::string` class, the `std::string_view`
+   stores a pointer to the array and a length, but unlike C-style strings and
+   the `std::string` class, the array is not guaranteed to be null-terminated.
+   `std::string_view` literals (`"Hello world!"sv`) create a temporary
+   `std::string_view` instance whose lifetime ends with the statement, but the
+   underlying character array has global static storage and lifetime.
+
+   String views are useful to represent in a single way strings with
+   heterogenous underlying storage, as they can be constructed from a `char *`
+   or a `std::string`. They reduce the risk of buffer out-of-bounds accesses as
+   they carry the array length, and they speed up handling of substrings as
+   they don't cause copies, unlike `std::string`. As the string views store a
+   borrowed reference to the underlying storage, they are prone to
+   use-after-free issues. The lack of a null terminator furthermore makes them
+   unsafe to handle strings that need to be passed to C functions that assume
+   null-terminated strings (such as most of the C standard library functions).
+
+   C++17 lacks many functions that make `std::string_view` usage convenient.
+   This includes `operator+()` to concatenate a `std::string` and a
+   `std::string_view`, as well as heterogenous lookup functions for containers
+   (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2363r5.html).
+   libcamera works around the former by defining the missing operators, and the
+   latter by using the `find()` function instead of `at()` or `operator[]()`.
+   Those issues are properly addressed in C++26.
+
+With those pros, cons are caveats in mind, libcamera follows a set of rules
+that governs selection of appropriate string types for function arguments:
+
+* If the function is internal to a compilation unit, and all callers use the
+  same string type for the function argument, use that type (by reference or
+  value, as appropriate).
+* If the function needs to modify the string, make a copy, or convert it to a
+  `std::string` for any other reason, pass a `std::string` by value. Use
+  `std::move()` in the caller if the string is not needed after the function
+  call, as well as inside the function to move the string to a local copy if
+  needed. This minimizes the number of copies in all use cases.
+* If the function needs to pass the string to a C function that takes a
+  `const char *`, and the caller may reasonably use a C string literal, pass a
+  `const char *`. This avoids copies when the caller uses a `char *` pointer or
+  a C string literal, and avoids buffer out-of-bound reads that could occur with
+  a non null-terminated `std::string_view`.
+* If the function only needs to pass the string to other functions that take a
+  `const std::string &` reference, pass a `const std::string &`. This minimizes
+  the construction of `std::string` instances when the caller already holds an
+  instance.
+* If the function only reads from the string, doesn't rely on it being
+  null-terminated, and can reasonably be called with either C strings, C string
+  literals, or `std::string` instances, pass a `std::string_view` instance. This
+  improves safety of the code without impacting performance.
+* Do not use `std::string_view` in public API function parameters. Previous
+  rules showed that selection of the optimal string type for function
+  parameters depends on the internal implementation of the function, which
+  could change over time. This conflicts with the stability requirements of the
+  libcamera public API. Furthermore, as C++17 makes `std::string_view` usage
+  inconvenient in many cases, we don't want to force that inconvenience upon
+  libcamera users.
+
 
 Documentation
 -------------