Embedded Programming for IoT. John Light, Intel OTC OpenIoT Summit, San Diego 6 April PDF Free Download

Embedded Programming for IoT John Light, Intel OTC OpenIoT Summit, San Diego 6 April 2016

Embedded programming is a lost art Fewer and fewer engineers understand embedded programming. Embedded software engineers are retiring and otherwise moving on. Most new engineers assume malloc, threads, and powerful CPUs. This decline in numbers matched the declining need until IoT appeared. IoT leaf nodes are proliferating radically, and many are very small devices. Small devices cost less and consume less power. Many leaf nodes can t run Linux, instead use RTOS. With so few resources, standard application methods shouldn t be used. 2

Embedded programming is critical to IoT Most IoT devices need careful embedded programming practices: Leaf nodes will proliferate in the millions, so they must be really inexpensive. Many leaf nodes will be highly power-constrained, some running on scavenged power, so they will have highly constrained resources. Many leaf nodes will need to run for months and years (decades?) without maintenance or replacement. Gateways will need to meet strict uptime requirements. Many gateways will be resource-constrained because of size, power, and cost constraints. 3

What I m going to talk about Why resources will remain constrained: cost, power, peripherals. General techniques of embedded programming: Memory handling Coroutines Myths about embedded memory management. What I m not going to talk about: RTOS choices and development environments. 4

Example processor for IoT Intel Quark SE (processor in the Intel Curie button SOC) 32 MHz processor (1992 i486 equivalent) 192 Kbytes code memory (late 80s PC equivalent) 80 Kbytes RAM (late 80s PC equivalent) Modern peripherals: Extremely low power sleep modes. Sensor hub, runs independently of processor. Accelerometer with 128 neuron net for pattern matching ADC with 19 analog comparators Battery charging circuit (in Curie) Bluetooth LE (in Curie) 5

Myth: IoT devices will use the latest technology IoT devices are extremely price sensitive. (Because so many!) They are invariable designed for trailing-edge processes to reduce cost. Older process nodes (65nm vs. 22nm) are much cheaper. 1-5 process nodes behind for many ARM IoT chips. 2-3 process nodes behind for Intel chips. This limits performance and other resources. 6

Myth: Moore s Law will solve the resource shortage Process improvements (e.g., 22nm -> 14nm) bring architectural tradeoffs. A new semiconductor node can be used several ways for an SOC: a. Make the die smaller to reduce cost. (Limitation: external interconnect) b. Leave the die the same size and put more computing resources on it. c. Leave the die the same size and add/improve peripheral features. d. Some combination of the above. I believe new processes will be used for two primary purposes: a. Reduce cost. The huge numbers of IoT devices will make them extremely price sensitive. b. Add/improve features. The IoT has a voracious appetite for new networking and specialized processing. c. Computing resources may grow, but not much. 7

Memory shortage Memory is always short in embedded systems. (See previous slides.) Even if there s enough of one kind of memory, another will be short. Which kind runs out first depends on application. This leads to trading one off for another. Normal heap usage (malloc/free) leads to fragmentation. Fragmentation is more likely as memory is more constrained. In machines with Virtual Memory, the effect is increased memory usage. In embedded systems, the effect is erratic behavior or system termination. In embedded programming, memory usage design is critical. An RTOS will support good memory design, but it won t hide the need. 8

Heap failure Heap failure is the inability of the heap to supply a required allocation in spite of having enough memory. Heap failure results from leaks (easily fixable) and fragmentation (not so much). Result of heap failure: program stops working. Whether you test return code from malloc() or not. Often unreported because report path or recovery code also has heap failure. Common story: It mysteriously stopped running but worked fine after restart. Most heap failures occur over time, and IoT nodes are often long-lived. 9

Fragmentation 1024 500 13 511 In this example, the heap is unable to provide 600 even though 1011 is available. This may seem contrived, but it happens all the time. Fragmentation is hidden by virtual memory, resulting in heap bloat. Malloc algorithms help a little bit, but only for specific usage patterns. Fragmentation occurs over time, often as a result of unexpected behavior. Testing is useful, but no amount of testing insures it won t happen. Checking for malloc failure doesn t keep it from happening. When malloc failure is found, others are likely going to happen right away. Reporting a malloc failure is hard because the reporting path may fail as well. 10

What are the symptoms of heap failure? Most programmers outside embedded programming have never seen and recognized a heap failure. Modern computers have amazing amounts of memory. Most applications don t run very long. We re all inured to random failures that clear up after restart. Heap failure reporting is often non-existent or weak. Many applications are built with sophisticated memory handling. 11

Myth: memory fragmentation results from leaks Memory fragmentation can occur when all memory leaks are eliminated. Fragmentation results from emergent behavior in an allocation algorithm relative to a specific usage pattern. Not all usage patterns result in fragmentation. Most don t. It s very hard to predict when fragmentation will occur. It s better to assume fragmentation will occur and avoid malloc/free. Of course, handling leaks is critical on embedded systems. Eliminate them! 12

Myth: fragmentation results from calling malloc() All fragmentation results from calling free(). If free is never called, there will be no fragmentation. This means you can use available malloc() calls to populate buffer pools. Just never free those pool buffers, and you ll be ok. This also give you a tool for analyzing existing software: look for the free() calls. 13

Myth: memory management hardware helps Since fragmentation occurs in highly constrained systems and highly constrained systems often lack memory management hardware, it is easy to assume: 1. The lack of memory management hardware is the source of the problem, and 2. Upgrading to an MMU will solve the problem. Neither is true. Eliminating this magical thinking will make it easier to solve the real problem. Fragmentation occurs with or without memory management hardware. 14

Bonus Myth: VM doesn t help, either. This Myth doesn t usually apply to embedded programming since few IoT leaf nodes will have Virtual Memory. To be pedantic, for many application memory usage patterns that result in heap failure due to fragmentation, adding more memory can result in a system that doesn t experience heap failure, but that s due to providing more memory, and has nothing to do with whether Virtual Memory is used to provide it. This myth applies to IoT gateways. (As do the others.) 15

The limitations of testing It s necessary to test IoT programs but not sufficient to ensure that they won t suffer from memory failures in normal heap management. Memory failures typically result from specific usage patterns, and the great variety of usage patterns is not predictable. If a failure usage pattern is not applied during testing, then that failure will not be seen during testing. Some memory failures result from long term usage, and there s no way to know how long that is, and it may be longer than any practical testing. Embedded programming of memory usage attempts to minimize the stochastic sources of memory failure. 16

Eliminate fragmentation Never call free()! How? Drastically minimize number of allocated buffers. (One per transaction?) Declare temporary buffers in other buffers or on stack. Be willing to waste memory by allocating maximum sized struct elements. Declare a small number of different buffer sizes (<10, even better: <5) Allocate these sizes from pools of fixed size buffers. Allocate fewer and bigger (more wasteful) buffers. 17

Think in terms of transactions IoT leaf node processing often consists of a series of largely independent transactions. Allocate one buffer to handle an entire transaction. If transaction buffer can be allocated, you are assured that the transaction can complete. If you can t allocate the one buffer, you can make an explicit decision about how to respond. Different types of transactions may require different size buffers. If the difference is less than a factor of 2, waste space by allocating the larger buffer size. If the difference is more than a factor of 2, allocate a small number of different buffers sizes. Transaction may leave traces, which may require their own buffers. The traces may require their own buffer size. 18

Barriers to transactional thinking Traditional software design encourages functional encapsulation. A transaction may consist of multiple protocol levels. (think socket, UDP, IP, link) Encapsulated design suggests a structure (or more) per level. (so far, so good) Traditional design suggests allocating a buffer for each structure, and linking them together. (this is where we get in trouble) Embedded design suggests allocating a single composite structure. A single buffer allocation is required for the composite structure of the transaction. Pointer handling is largely eliminated (and it s a major source of programming errors.) Some parts of the composite structure may not be needed. (Waste!) 19

Myth: wasting scarce memory must be avoided Embedded programming declares that predictability is more important than waste. Allocating the largest needed memory means that: The typical case will consume more memory than required. (waste!) The largest case can be handled. The logic, tests, and errors associated with different case sizes is eliminated. Waste is your friend. Consume away! 20

Waste example: string allocation struct foo {... char *name;... } struct foo {... char name[max_name_size];... } The left side requires a malloc for the string length. The right side doesn t. Predictability suggests using the right side all the time. Strings always have a maximum length. If you don t know what it is, you haven t done your homework. Secret: for smaller buffers, the heap overhead for an allocation is often larger than the amount of memory you think you are saving. 21

Waste example: temporary buffers struct foo {... struct bar *temp1;... } struct foo {... struct bar temp1;... } For most of a transaction temp1 will be unused. If struct bar is large, right side wastes much more than left side. In general, the left side doesn t reduce the maximum heap usage required during the transaction. Maximum usage determines heap failure. In all cases, after allocating foo, the right side ensures that the transaction will complete. Temporaries can can also be declared (and allocated) on the stack. 22

Transaction traces Transaction traces are side effects of a transaction that must be retained after the transaction completes and its transaction buffer is deleted. Trace example: a transaction may start persistent ongoing processing (observation/reporting?) that continues until stopped. Trace buffers need to be allocated independently of transaction buffers, and their size may be much different than that of a transaction buffer. A trace buffer typically holds a single (composite?) structure that holds all context needed for the purpose of the trace. 23

Buffer pools Once the total number of allocations has been reduced to a small number, you can create a memory pool for each size. Example: 3 pools, 24 bytes, 65 bytes, 1264 bytes. 1264b buffer is used for each transaction, two trace types are managed (24b and 65b). Guess how many of each size will be needed, and allocate those at startup. Each buffer type is allocated out of its respective memory pool only. When a buffer is no longer needed, it is returned to the memory pool whence it came. Run and analyze the number of each pool type that is needed. Study available memory and increase pool sizes to provide margin. 24

Benefits of memory pools Benefits aside from eliminating memory fragmentation. You can characterize and guarantee the behavior of your program: It will handle three transactions at a time. (3 transaction buffers) It will handle 5 observation requests. (5 observation traces) You can monitor the program and easily adjust parameters. It often runs low on observation trace buffers, so I will add a few. It informed me via network, that it is running short on observation trace buffers. You can implement fall-back strategies. If it runs out of observation trace buffers, cannibalize a transaction to add trace buffers and notify me that retuning is needed. 25

Exception handling IoT nodes need to report exceptions, most notably Out of Memory exceptions. The exception reporting path must be free of buffer allocations that can fail. This can be accomplished by pre-allocating buffers needed for reporting. Static allocation can also be used for the report buffers. Another case of predictability being more important than waste. IoT nodes can use tuning information such as buffer pool usage to report information that might predict future problems. 26

Embedded memory summary IoT leaf node software (and some gateways) will continue to be highly resourceconstrained due to requirements for minimizing cost, power, and maintenance. Embedded software requires explicit and careful memory handling to reliably run for long periods of time with limited memory resources. Techniques for careful embedded software memory handling are well known in the embedded application community, but they need wider visibility in the growing IoT community. 27

Coroutines Coroutines are a method for decomposing complex designs into manageable processes. Think of them as threads when you don t have threads. "Subroutines are special cases of... coroutines." Donald Knuth, Fundamental Algorithms Coroutines have been largely forgotten with the advent of object-oriented programming and the availability of thread libraries. Because highly constrained computing environments often lack threads, embedded programming uses coroutines in place of them to provide much the same benefit. 28

Threads Threading is used in conventional programming for two primary purposes: Decomposing complex systems. Providing concurrent processing on machines with CPU threads. Constrained computers often have a single thread or a limited number. When two threads are available in constrained environments, one thread is often dedicated to a particular purpose, such as the OS or sensor response. Threads remain a valuable tool for decomposing designs (meaning making software maintainable ). Coroutines fill that gap when hardware threads are not available. 29

Process decomposition Input Process Output There s usually some sort of processing in the Input and Output sections (checking peripherals, qualifying and transforming input, formatting output). Threading allows writing these processes independently and connecting them. 30

Process implementation with threads Input Process Output queue queue The queues allow separation of function and activity. In an environment with threads, you might apply a thread to each process block. Without threads, this might be written as a single logical process, so the implementation would be completely different. 31

Process implementation with coroutines Input Process Output queue queue Notice that the implementations with threads and coroutines look similar. That s the point. By thinking with coroutines, you can make the same code work with and without a threading library and hardware threads. 32

Threads and coroutines coexist (flow) threads Input Process Output queue queue in q1 (see next page) q2 out coroutines Input Process Output queue queue 33

Example coroutine code for previous slide void function input { while (inhasdata() && q1hasroom()) { d = infetch(); filterdata(&d, &f);... q1send(&f); } } void function process() { while (1) { input(); output(); if (q1hasdata() && q2hasroom()) { d = q1fetch(); processdata(&d, &r); q2send(&r); } } } void function output() { while (q2hasdata() && outhasroom()) { d = q2fetch(); formatdata(&d, &f);... outsend(&f); } } 34

Comments on code on previous slide The code is one example of the myriad ways coroutines can be written. No coroutine library is needed. To switch between threads and coroutines, a small amount of conditional code would be needed. The operation of the coroutines can be modified by changing the conditionals in the example. By using coroutines we can decompose processes without having threads. 35

Summary The characteristics of many IoT system means that embedded programming skills and techniques will be needed in a wide arena. Two aspects of embedded programming were discussed: Raising memory handling to a first-order design criterion, and Using coroutines to allow functional decomposition in the absence of threads. I intend this presentation to be part of a larger conversation between these fields. 36

References I have written two white papers on this subject. This one describes the memory issues in some detail. https://01.org/blogs/2016/heap-allocation-iot This one is a case study on using these techniques on real software. https://01.org/blogs/2016/iot-memory-management-case-study They should be active by conference start. 37

Embedded Programming for IoT. John Light, Intel OTC OpenIoT Summit, San Diego 6 April 2016