Keys to Writing Efficient Embedded Code - Embedded Systems | ECGR 4101, Study notes of Electrical and Electronics Engineering

Material Type: Notes; Professor: Conrad; Class: Embedded Systems; Subject: Electrical and Computer Engr; University: University of North Carolina - Charlotte; Term: Unknown 1997;

Typology: Study notes

Pre 2010

Uploaded on 07/28/2009

koofers-user-onu
koofers-user-onu 🇺🇸

9 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Irecently had an assignment in
which I was responsible for
identifying ways to write effi-
cient embedded code. What I
discovered isn’t rocket sci-
ence—common mistakes, misunder-
standings, or assumptions about the
demands made on the compiler and
over-estimating the power of the
microprocessor can adversely impact
the execution time of an application.
Most of my effort focused on imple-
menting code that doesn’t enable float-
ing-point operations, but instead relies
on the math libraries supplied by the
compiler vendor. Examples are pre-
sented primarily in C, but compiled in
C++. I will leave the analysis of virtu-
al tables and the like to the C++
experts. I hope a good compiler vendor
does a reasonable job of implementing
such things. Most of what I learned,
though, can be applied to any program-
ming language.
Inefficient code seems to be more
closely related to the human condition
than to the chosen programming lan-
guage. Slow code is probably slow
because that’s the way it was written,
however unintentionally. I do believe
that it’s better to first write code that is
correct and then to optimize it. There
by BILL TRUDELL
Keys to Writing Efficient
Embedded Code
52 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997
A key to writing efficient real-time embedded software is to under-
stand clearly your processor’s architecture, the programming lan-
guage, the compiler’s features, and the object model used by the
compiler. With this understanding, you can identify potentially
slow code, make the code faster, and thus write more efficient
applications.
Nance Paternoster
<BACK
pf3
pf4
pf5
pf8

Partial preview of the text

Download Keys to Writing Efficient Embedded Code - Embedded Systems | ECGR 4101 and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

I

recently had an assignment in which I was responsible for identifying ways to write effi- cient embedded code. What I discovered isn’t rocket sci- ence—common mistakes, misunder- standings, or assumptions about the demands made on the compiler and over-estimating the power of the microprocessor can adversely impact the execution time of an application. Most of my effort focused on imple- menting code that doesn’t enable float- ing-point operations, but instead relies on the math libraries supplied by the compiler vendor. Examples are pre- sented primarily in C, but compiled in C++. I will leave the analysis of virtu- al tables and the like to the C++ experts. I hope a good compiler vendor does a reasonable job of implementing such things. Most of what I learned, though, can be applied to any program- ming language. Inefficient code seems to be more closely related to the human condition than to the chosen programming lan- guage. Slow code is probably slow because that’s the way it was written, however unintentionally. I do believe that it’s better to first write code that is correct and then to optimize it. There

by BILL TRUDELL

Keys to Writing Efficient

Embedded Code

A key to writing efficient real-time embedded software is to under-

stand clearly your processor’s architecture, the programming lan-

guage, the compiler’s features, and the object model used by the

compiler. With this understanding, you can identify potentially

slow code, make the code faster, and thus write more efficient

applications.

Nance Paternoster

will always be another compiler switch, faster clock chip, or newer processor around the bend. Well-writ- ten test drivers can prove equality between two implementations. My general assumption is that to implement efficient embedded soft- ware, a developer must be familiar with the code that the compiler gener- ates, as well as with the microproces- sor architecture. Writing efficient code, though, can sometimes make it less portable. Code written in Assembler is usually processor-specif- ic and not portable. The code you write today will very likely need to run on a different processor in a year or two. Using Assembler, then, to make the code fast might not be prudent, except for interrupt service routines or fre- quently-used functions. Analysis of any improvements made to working code is very important. Validate changes to ensure that errors have not been introduced. Make sure that the desired level of precision and the accuracy with which calculations are performed is maintained or is ade- quate. You can easily overlook round- ing and truncation errors.

DATA TYPE SPECIFICATIONS

O

ne common oversight is speci- fying the wrong data type and then allowing the compiler or preprocessor to convert the type auto-

matically. Automatic type conversion is generally taken for granted, but it does chew up valuable processor time. Without looking at the related Assembler, the code might compile, link, run, and produce the right output, yet be very inefficient. Table 1 contrasts two code segments generated for a processor without an FPU. The segment on the left omitted the use of the single-precision floating- point specifier “f”—an easily over- looked mistake. The value 10. defaulted to double precision and forced the numerator to first be con- verted to double precision before the double-precision divide. The result of the division, a double-precision value, is then converted back to single preci- sion. Without an FPU, these conver- sion operations, as implemented in a software math library, are very expen- sive as compared to integer operations. Correctly specifying the divisor as a single-precision value produces the code shown in the right-hand segment in Table 1. A single-precision division is used and the type conversions are avoided. If the numerator were an inte- ger, a conversion from long to float would be required. This would be a time-consuming operation that could be avoided if the data type were speci- fied as float. (See section A.6 of The C Programming Language by Kernighan and Ritchie.^1 ) Single-precision accuracy is proba- bly used more frequently than double precision. Therefore, if double preci- sion is required, a simple comment in the code would remove all doubt as to the developer’s intention and design.

AUTOMATIC PROMOTIONS

I

mplied promotions can easily be taken for granted. In some cases you’ll find it desirable or necessary to write and debug the code first in a PC environment, and later port or recompile it for the embedded proces- sor. The clock speed on the PC will usually be much faster than the embed- ded hardware, and the PC will surely

OCTOBER 1997 EMBEDDED SYSTEMS PROGRAMMING 53

Automatic type

conversion is

generally taken

for granted,

but it does chew

up valuable

processor time.

Efficient Embedded Code

Compilers generally have their own optimization for switch and case state- ments which use jump tables and take note of large gaps in values used in the various cases. Table 5 shows some dif- ferent ways to optimize algebraic equations. Look for the repeated use or evalua- tion of the same expression. In Equation 1, the product A * B is evalu- ated twice. Defining another variable to hold the product increases code size but avoids the extra multiply. Depending on the processor and data type, this can result in significant time

savings. Note the following example:

D = A / ( B * C ) E = 1 / (1 + ( B * C ))

evaluate B * C once,

bc = B * C (1)

LITERAL DEFINITIONS

T

he specification of common val- ues using #defines or const terms might be pragmatic, but is also prone to error. In the following example, several significant observa-

tions can be made. The value 3.14 is double precision, forcing a double-pre- cision multiply and later a double- to single-precision type conversion call, all of which is time-consuming.

#define TWO_PI 2 * 3. ; float c, r; ; ; c = 2 * 3.14 * r; move.l -8(a6),-(sp) // load r jsr __ftod // Make dbl move.l #1374389535,-(sp) // Load move.l #1075388088,-(sp) // 2 * PI jsr __dmul // Multiply 2PI * r jsr __dtof // Convert to single move.l (sp)+,-4(a6) // Save in c

The next example shows that the defined value, along with the hierarchy of operators, results in an incorrect solution for a circle’s radius because the circumference variable c is first divided by two, and not the product (

  • PI). The code also shows that the multiplication of 2 * PI occurs at run time everywhere the literal TWO_PI is used:

#define TWO_PI 2 * 3.14f ; float c, r; ; ; r = c / 2 * 3.14f; move.l -4(a6),-(sp) move.l #1073741824,-(sp) jsr __fdiv move.l #1078523331,-(sp) jsr __fmul move.l (sp)+,-8(a6) ?line 10973,

Two correct implementations fol- low, one using a #define, and the other, a const variable. The substitution for the literal definitions avoids a multiply because the preprocessor evaluates the expression inside the parentheses at compile time. Similarly, the const vari- able is also evaluated at compile time, but requires more memory and some overhead for referencing the variable:

#define TWO_PI (float)(2 * 3.14) // const float TWO_PI = 2 * 3.14f;

TABLE 3

Rearranging equations for efficient preprocessing.

C Code Example: float num = ( 17.0f / 10.0f) * (float)(sqrt( val ));

Assembler: move.l -4(a6),-(sp) // Load Stack with float val jsr __ftod // Convert val from Single to Double jsr _sqrt // Double Precision Square Root addq.l #4,sp move.l d1,(sp) // Load Stack with Double Result move.l d0,-(sp) jsr __dtof // Convert sqrt() Double Result to Single move.l #1071225242,-(sp) // Load pre-evaluated 17.0f/10.0f jsr __fmul // Single Precision Multiply move.l (sp)+,-8(a6) // Pop and Store Result in variable num

TABLE 4

Rewriting algebraic expressions for better efficiency.

C Code Successive Divides: C Code Multiply instead of Divide: D = A / B / C; D = A / ( B * C );

Assembler: Assembler:

move.l -4(a6),-(sp) move.l -4(a6),-(sp) move.l -8(a6),-(sp) move.l -8(a6),-(sp) jsr __fdiv move.l -12(a6),-(sp) move.l -12(a6),-(sp) jsr __fmul jsr __fdiv jsr __fdiv move.l (sp)+,-16(a6) move.l (sp)+,-16(a6)

INTEGER MATH VS. FLOATING POINT

I

f inputs are bound by definition, convention, or data type, a chance exists that floating-point computa- tions might be substituted for integer math and appropriate scaling, as shown in Table 6. The savings in this case may seem unimpressive, but replacing a floating- point subtraction with a left shift and integer subtraction represents a signifi- cant savings in execution time. Don’t forget that pushing arguments on the stack, jumping to the subroutine, returning, and adjusting the stack are all overhead in solving the problem.

THE STANDARD MATH LIBRARY

A

s I’ve already mentioned, the standard math library general- ly expects double-precision values. Massive penalties result when converting from single precision to double precision and back again when using floating-point emulation soft-

ware. However, using double precision as default data types isn’t the solution either, because it consumes more space and time (unless you’re using a Pentium Pro, which does everything in double precision but isn’t really an embedded processor). Table 7 shows how a single-precision absolute value can be written. While the alternative generates more code, it’s much faster than the type conversion function calls. This alterna- tive can be encapsulated in a macro or in-line function. It would be even bet- ter if the function abs() was overloaded for all relevant data types.

COMPILER OPTIONS AND ISSUES

C

ompilers offer many degrees of optimization. Some of these features are related to the pro- gramming language or object model, while others use specific knowledge of the processor to make the code execute faster. If developers are expected to

Efficient Embedded Code

TABLE 5

Algebraic simplifications and the laws of exponents.

Original Expression: Optimized Expression: a^2 - 3 a + 2 ( a - 1) * ( a - 2) 1 multiply (square term) 2 subtractions 1 subtraction 1 multiplication 1 multiply ( 3a) 1 addition

( a - 1) * ( a + 1) a^2 - 1 1 subtraction 1 multiply (square term) 1 multiplication 1 subtraction 1 addition

1 / (1 + a / b ) b / ( b + a * b ) 2 divides 1 divide 1 addition 1 addition 1 multiply

a m^ * an^ am^ +^ n 2 power functions 1 addition 1 multiply 1 power function

( am ) n^ am^ *^ n 2 power functions 1 multiply 1 power function

The inlining of run-time libraries can also help reduce execution time by avoiding function call overhead or by supporting optimized operations. For example, a memcpy routine can be opti- mized for a small number of bytes and result in an inline expansion. If the size is very large or the data type is user- specified instead of a primitive type, a function call to the memcpy routine might be generated. I found in one par- ticular case that using memset resulted in a function call. While the library function handled the various data orga- nization schemes that could be selected by the user, a quicker inline version might be a better choice if the size is known to be small. Even though the

internals of the memset function were implemented in Assembler, the over- head for setting a long word to zero by using a memset was excessive. For portability, among other rea- sons, you should give special care to the alignment of data. These choices can affect efficiency and can vary with the processor used. If your processor has an FPU, make sure the compiler has this switch turned on and that you’re not running software to do floating-point calculations. Default options should be well understood because they may have an impact on performance. For example, Microsoft Visual C++ supports a workaround for Pentium processors

with flawed floating-point instructions. The Help Index states: “By default, the workaround is dis- abled (/QIfdiv), and the code generator emits code that is unsafe on a flawed Pentium. “If the workaround is enabled (/QIfdiv), the code generator emits fat- ter, safe code that tests for the proces- sor bug and calls run-time routines instead of using the native instructions of the processor to generate correct floating-point results.”^2 So a trade-off exists between accu- racy and speed. Therefore, be very careful when generating benchmarks and comparing the accuracy of gener- ated values with specific versions of programs, like a debug build versus a release build. It’s important to under- stand the implications of the switches chosen. Obviously, with a flawed Pentium, the run-time routines will run much slower than native instructions, but they’ll be more accurate. A Pentium would not be a typical embed- ded processor selection, but your processor or compiler may have its own set of quirks. Some high-end processors have both instruction and data caches. The com- piler provides switches for enabling these features. The instruction cache should be enabled; the data cache should only be enabled if sufficient consideration has been given to data synchronization. Multiprocessing will require extra hardware for bus snoop- ing to be sure that the data cache is synchronized between two or more processors or processes.

MEMORY ALLOCATION

F

or time-critical sections of code, the use of the memory manager should be scrutinized. If the size of an object or data type is small and the scope is sufficiently restricted, the stack might be a better choice for a storage area. This analysis might only be obvious after a design has been implemented. Some assumptions can

Efficient Embedded Code

TABLE 7

Floating-point absolute value.

C Code: output = fabs(input); // (Slow Code Ahead)

Assembler: move.l -4(a6),-(sp) // Load input on stack jsr __ftod // Convert it from Single to Double Precision jsr _fabs // Double Precision ABS addq.l #4,sp move.l d1,(sp) // Load result move.l d0,-(sp) jsr __dtof // Convert result from Double to Single move.l (sp)+,-8(a6) // Save result in output

C Code Test against zero instead of abs() function: if ( input < 0 ) output = - input; else output = input;

Assembler: move.l -4(a6),-(sp) clr.l -(sp) jsr __fcmp bge.s L move.l -4(a6),-(sp) eori.b #128,(sp) // XOR the sign bit move.l (sp)+,-8(a6) // Save output bra.s L L38:move.l -4(a6),-8(a6) // Save output L39:

Efficient Embedded Code

then be made regarding the future growth of an object: if the object is small, maybe it’s better left on the stack, rather than incurring the over- head of calling memory allocation functions. Another benefit is that keep- ing the smaller segments out of the heap can lessen fragmentation. In C and C++, using malloc, free, new, and delete doesn’t come without some time penalty, yet it can be more flexible than using the stack.

C++

S

ome mistakes I’ve frequently made in C++ code have made my programs run more slowly than I would have liked. I found it especially important to know when a constructor is being executed. Trace statements or reference counting can help monitor these events. Most often, it’s the copy constructor or assignment operator that is easily overlooked. It was only after stepping through the code that I noticed my design and implementation required the use of a copy constructor. The copy constructor was being called in a tight loop, which cost significant execution time. The design had to be re-done. Whenever possible, use a profiler or incorporate some crude timing ele- ments to judge how efficient the pro- gram is executing. This tactic is handy for benchmarking and making changes to performance that are difficult to prove except by empirical methods. Some of the better development envi- ronments have a built-in profiler or an add-on component that monitors entries and returns from methods. Another common mistake is passing an object back on the stack instead of passing its reference. The copy con- structor is executed for the temporary object. After a design has been implement- ed, tested, and used, a second pass can be made to improve the application. Classes that were originally necessary may become superfluous and can be eliminated or encapsulated in a super- class. This may remove unnecessary

layers of inheritance which add to the overhead and processing time of an application.

MISCELLANEOUS SETBACKS

A

common source of perfor- mance degradation is the cut- ting and pasting of working code as a base for newer features. One bad example, copied and promulgated in the system, significantly degrades performance. Bad examples should always be corrected, or at least com- mented. Developers will appreciate your honesty and humility. We often write code with the assumption that it might be used in the future. When the future arrives, the requirements may have changed or may be understood such that the old design is insufficient or wrong. Thus, when in doubt, leave it out. Don’t code if it isn’t needed. If you’re using archival tools, the questionable code can be saved there. It’s frustrating to browse through code that is no longer built, looking for code that needs to be optimized, or for common defects in relic code.

RECOGNIZE THE SIGNS

I

’ve discussed some common implementation oversights and mistakes and offered some recom- mendations regarding implementation

details that will optimize code for time. A top-down approach should generally be used to find where the time is being spent—hopefully, you’ll find a smok- ing gun. However, this approach will address only specific instances of code inefficiency. A parallel path can also be taken, in which the cross reference section of link maps are inspected for unusual references. Double-precision math operations are an example. Most importantly, optimize something that already works—the first step is to get the correct solution. Some fundamental concepts must first be understood in order to write efficient real-time embedded software. These concepts include knowing your processor’s architecture, the program- ming language used, the features of the compiler, and even the object model used by the compiler. Inspecting the code generated by the compiler closes the loop and allows you to see how well the compiler has interpreted your coding requests and converted them into executable code. Be ready for some surprises. With this understanding and mind- set, you will learn to recognize those warning signs that read “slow code ahead.” Then you can speed up your code, and write more efficient applica- tions in the process.

Bill Trudell is a software engineer currently employed by Fisher Rosemount Systems, Inc., a solutions provider for the process control indus- try. He has extensive experience in the design of real-time multitasking embedded and PC applications in var- ious domains. Bill can be reached via e-mail at [email protected].

REFERENCES

  1. Kernighan, Brian W. and Dennis M. Ritchie. The C Programming Language, Second Edition. Englewood Cliffs, NJ: Prentice-Hall, 1988.
  2. Microsoft Visual C++, Version 4.2, Books On-line. Microsoft Corp., Redmond, WA.

A top-down

approach should

be used to find

where the time

is being spent—

hopefully,

you’ll find a

smoking gun.