Hello! My name is Alexander and I work as a microcontroller developer.
When starting a new project at work, I habitually added the source files of all sorts of useful utilities to the project tree. And on the header, the app_debug.h froze for a bit.
We published and translated this article with the copyright holder's permission. The author is Alexander Sazhin (Nickname - Saalur [RU], email - a.y.sazhin@gmail.com). The article was originally published on Habr.
You see, last December, GNU Arm Embedded Toolchain released 10-2020-q4-major, which included all GCC 10.2 features, and hence supported Concepts, Ranges, Coroutines and other less prominent C++20 novelties.
Inspired by the new standard, my imagination depicted my future C++ code as ultramodern, and concise and poetic. And the good old printf("Debug message\n") didn't really fit into this joyful plan.
I wanted the combination of uncompromising C++ functionality and the standard's usability!
float raw[] = {3.1416, 2.7183, 1.618};
array<int, 3> arr{123, 456, 789};
cout << int{2021} << '\n'
<< float{9.806} << '\n'
<< raw << '\n'
<< arr << '\n'
<< "Hello, Habr!" << '\n'
<< ("esreveR me!" | views::take(7) | views::reverse ) << '\n';
Well, if you want something good, why deny yourself?
Let's implement an interface of the stream in C++20 for debugging output on MCU that would support any suitable protocol provided by the microcontoller's vendor. It should be lightweight and fast, without boilerplate code. Such thread interface should also support both blocking character output for time-insensitive code sections, and non-blocking - for fast functions.
Let's set several convenient aliases to make code comfortable to read:
using base_t = std::uint32_t;
using fast_t = std::uint_fast32_t;
using index_t = std::size_t;
As is known, in microcontrollers, non-blocking data transfer algorithms are implemented by interrupts and DMA. To identify the output modes, let's create enum:
enum class BusMode{
BLOCKING,
IT,
DMA,
};
Let's describe a base class that implements the logic of the protocols that are responsible for debug output:
[SPOILER BLOCK BEGINS]
template<typename T>
class BusInterface{
public:
using derived_ptr = T*;
static constexpr BusMode mode = T::mode;
void send (const char arr[], index_t num) noexcept {
if constexpr (BusMode::BLOCKING == mode){
derived()->send_block(arr, num);
} else if (BusMode::IT == mode){
derived()->send_it(arr, num);
} else if (BusMode::DMA == mode){
derived()->send_dma(arr, num);
}
}
private:
derived_ptr derived(void) noexcept{
return static_cast<derived_ptr>(this);
}
void send_block (const char arr[], const index_t num) noexcept {}
void send_it (const char arr[], const index_t num) noexcept {}
void send_dma (const char arr[], const index_t num) noexcept {}
};
[SPOILER BLOCK ENDS]
The class is implemented with the CRTP pattern, which gives us the advantages of compile-time polymorphism. The class contains a single public send() method. In this method, at the compilation stage, depending on the output mode, the necessary method is selected. As arguments, the method take a pointer to the data buffer and its useful size. In my practice, this is the most common argument format in the HAL functions of MCU vendors.
And then, for example, the Uart class inherited from this base class will look something like this:
[SPOILER BLOCK BEGINS]
template<BusMode Mode>
class Uart final : public BusInterface<Uart<Mode>> {
private:
static constexpr BusMode mode = Mode;
void send_block (const char arr[], const index_t num) noexcept{
HAL_UART_Transmit(
&huart,
bit_cast<std::uint8_t*>(arr),
std::uint16_t(num),
base_t{5000}
);
}
void send_it (const char arr[], const index_t num) noexcept {
HAL_UART_Transmit_IT(
&huart,
bit_cast<std::uint8_t*>(arr),
std::uint16_t(num)
);
}
void send_dma (const char arr[], const index_t num) noexcept {
HAL_UART_Transmit_DMA(
&huart,
bit_cast<std::uint8_t*>(arr),
std::uint16_t(num)
);
}
friend class BusInterface<Uart<BusMode::BLOCKING>>;
friend class BusInterface<Uart<BusMode::IT>>;
friend class BusInterface<Uart<BusMode::DMA>>;
};
[SPOILER BLOCK ENDS]
By analogy, one can implement classes of other protocols supported by the microcontroller. Just replace the corresponding HAL functions in the send_block(), send_it() and send_dma() methods. If the data transfer protocol does not support all modes, then the corresponding method is simply not defined.
And to conclude this part of the article, let's create short aliases of the final Uart class:
using UartBlocking = BusInterface<Uart<BusMode::BLOCKING>>;
using UartIt = BusInterface<Uart<BusMode::IT>>;
using UartDma = BusInterface<Uart<BusMode::DMA>>;
Great, now let's develop the output thread class:
[SPOILER BLOCK BEGINS]
template <class Bus, char Delim>
class StreamBase final: public StreamStorage
{
public:
using bus_t = Bus;
using stream_t = StreamBase<Bus, Delim>;
static constexpr BusMode mode = bus_t::mode;
StreamBase() = default;
~StreamBase(){ if constexpr (BusMode::BLOCKING != mode) flush(); }
StreamBase(const StreamBase&) = delete;
StreamBase& operator= (const StreamBase&) = delete;
stream_t& operator << (const char_type auto c){
if constexpr (BusMode::BLOCKING == mode){
bus.send(&c, 1);
} else {
*it = c;
it = std::next(it);
}
return *this;
}
stream_t& operator << (const std::floating_point auto f){
if constexpr (BusMode::BLOCKING == mode){
auto [ptr, cnt] = NumConvert::to_string_float(f, buffer.data());
bus.send(ptr, cnt);
} else {
auto [ptr, cnt] = NumConvert::to_string_float(
f, buffer.data() + std::distance(buffer.begin(), it));
it = std::next(it, cnt);
}
return *this;
}
stream_t& operator << (const num_type auto n){
auto [ptr, cnt] = NumConvert::to_string_integer( n, &buffer.back() );
if constexpr (BusMode::BLOCKING == mode){
bus.send(ptr, cnt);
} else {
auto src = std::prev(buffer.end(), cnt + 1);
it = std::copy(src, buffer.end(), it);
}
return *this;
}
stream_t& operator << (const std::ranges::range auto& r){
std::ranges::for_each(r, [this](const auto val) {
if constexpr (char_type<decltype(val)>){
*this << val;
} else if (num_type<decltype(val)>
|| std::floating_point<decltype(val)>){
*this << val << Delim;
}
});
return *this;
}
private:
void flush (void) {
bus.send(buffer.data(),
std::distance(buffer.begin(), it));
it = buffer.begin();
}
std::span<char> buffer{storage};
std::span<char>::iterator it{buffer.begin()};
bus_t bus;
};
[SPOILER BLOCK ENDS]
Let's take a closer look at its significant parts.
The class template is parameterized by the protocol class - the Delim's value of the char type. This class template is inherited from the StreamStorage class. The only task of the latter is to provide access to the char array, in which output strings is formed in non-blocking mode. I am not giving the implementation here, it's not quite relevant to the topic at hand. It's up to you, you're welcome to check my example at the end of the article. For convenient and safe operation with this array (in the example - storage), let's create two private class members:
std::span<char> buffer{storage};
std::span<char>::iterator it{buffer.begin()};
Delim is a delimiter between the values of numbers when displaying the contents of arrays/containers.
The public methods of the class are four operator<< overloads. Three of them display the basic types that our interface will work with (char, float, and integral type). The fourth one displays the contents of arrays and standard containers.
And this is where the most exciting part begins.
Each output operator overload is a template function in which the template parameter is limited by the requirements of the specified concept. I use my own char_type, num_type concepts...
template <typename T>
concept char_type = std::same_as<T, char>;
template <typename T>
concept num_type = std::integral<T> && !char_type<T>;
... and concepts from the standard library - std::floating_point and std::ranges::range.
Basic type concepts protect us from ambiguous overloads, and in combination with the range concept allow us to implement a single output algorithm for any standard containers and arrays.
The logic inside each base type output operator is simple. Depending on the output mode (blocking/non-blocking), we either immediately send the character to print, or we form a string in the thread buffer. When you exit the function, the object of our thread is destroyed. A destructor is called, where the private flush() method sends the prepared string to print in IT or DMA mode.
When converting a numeric value to the chars' array, I gave up the well-known idiom with snprintf() in favor of neiver's [RU] program solutions. The author in his publications shows a noticeable superiority of the proposed algorithms for converting numbers into a string both in the size of the binary and in the conversion speed. I borrowed the code from him and encapsulated it in the NumConvert class, which contains the to_string_integer() and to_string_float() methods.
In overloading of the array/container data output operator, we use the standard std::ranges::for_each() algorithm and go through the range contents. If the element meets the char_type concept, we output the string without whitespace. If the element meets the num_type or std::floating_point concepts, we separate the values with the specified Delim's value.
Well, we've made everything so complicated with all these templates, concepts, and other C++ "heavy" stuff here. So, are we going to get the wall of text from the assembler at the output? Let's look at two examples:
int main() {
using StreamUartBlocking = StreamBase<UartBlocking, ' '>;
StreamUartBlocking cout;
cout << 'A'; // 1
cout << ("esreveR me!" | std::views::take(7) | std::views::reverse); // 2
return 0;
}
Let's mark the compiler flags: -std=gnu++20 -Os -fno-exceptions -fno-rtti. Then in the first example we get the following assembler listing:
main:
push {r3, lr}
movs r0, #65
bl putchar
movs r0, #0
pop {r3, pc}
And in the second example:
.LC0:
.ascii "esreveR me!\000"
main:
push {r3, r4, r5, lr}
ldr r5, .L4
movs r4, #5
.L3:
subs r4, r4, #1
bcc .L2
ldrb r0, [r5, r4] @ zero_extendqisi2
bl putchar
b .L3
.L2:
movs r0, #0
pop {r3, r4, r5, pc}
.L4:
.word .LC0
I think, the result is pretty good. We got the usual C++ thread interface, the convenient output of numeric values, containers/arrays. We also got the ranges processing directly in the output signature. And we got all this with virtually zero overhead.
Of course, during numeric values output, another code will be added to convert the number into a string.
You can test it online here (for clarity, I replaced the hardware dependent code with putchar()).
You can check/borrow the working code of the project from here. An example from the beginning of the article is implemented there.
This is the initial code variant. Some improvements and tests are still required to use it confidently. For example, we need to provide a synchronization mechanism for non-blocking output. Let's say, when the data output of the previous function has not yet been completed, and, within next function, we are already overwriting the buffer with new information. Also I need to carefully experiment with std::views algorithms. For example, when we apply the std::views::drop()to a string literal or an array of chars, the "inconsistent directions for distance and bound" error is thrown. Well, the standard is new, we will master it over time.
You can see how it works here. For the project, I used the dual-core STM32H745 microcontroller. From one core (480MHz), the output goes in blocking mode through the SWO debugging interface. The code from the example is executed in 9.2 microseconds, from the second core (240MHz) through Uart in DMA mode - in about 20 microseconds.
Something like that.
Thank you for your attention. I would be happy to get feedback and comments, as well as ideas and examples of how I can improve this mess.
0