Performance Benchmarking - Software Systems Implementation - Handout | ENEE 642, Study notes of Electrical and Electronics Engineering

Material Type: Notes; Professor: Stewart; Subject: Electrical & Computer Engineering; University: University of Maryland; Term: Spring 2000;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-lr8
koofers-user-lr8 🇺🇸

9 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Department of Electrical and Computer Engineering
University of Maryland at College Park
ENEE 642: Software Systems Implementation
Spring 2000, Prof. D. Stewart
Handout 5: Performance Benchmarking
Prior to performing any optimization of execution time after implementation (i.e. fine-grain optimization),
it is essential that the code to be optimized is measured first. Otherwise, what might seem like an optimiza-
tion might in fact slow down overall execution time of the code. Thus, measure execution time first, then
optimize part of the code, then measure execution time again. Compare the results, and determine when or
not you really were successful.
Never optimize execution time at the cost of software maintenance (e.g. readability, reusability, etc.) unless
it is essential for meeting the hard or soft timing requirements of the application.
In this handout, methods for measuring execution time are first described. Then guidelines and hints on how
to identify what code should be optimized to meet the timing specifications are discussed.
1. MEASURING EXECUTION TIME
Before doing any fine-grain optimization, you must measure the execution time of your code. Quite often,
fine-grain optimization is architecture specific, and what speeds up code with one compiler/processor pair
may actually slow down code with another. In this section, we go over various techniques for measuring
how long your code takes.
Once you know how long the code takes, you can then pinpoint those sections that take the longest amount
of time, and start with that. If you have two routines, one taking 90% of the CPU time, and the other 10%,
no sense optimizing the one that takes only 10%, since there is more potential savings in the other one for
the same code optimizing effort.
How can I make my code faster? First important step is to see how long it takes.
Some terminology:
Resolution: Limitations of your timer or timing method. How long is one tick?
Accuracy: If you get a timing of x +/- y, then y is your Accuracy.
Granularity: What part of the code you can measure.
Difficulty: How easy or difficult is it to use this method. More difficult means more time and money!
Many different methods exist to measure execution time. Generally, a method that improves one or two of
these dimensions costs in the other dimensions, Following is a list of some methods, in increasing order of
difficult for using the tools. Notice how difficulty increases, other parameters decrease.
Method Difficulty Resolution Accuracy Granulariy
Stop-watch Easy 0.01 sec 1-sec Program
date/time Easy 0.02 sec 0.2 sec Program
prof and gprof Moderate 10 msec 20 msec subroutines
clock() Moderate 15-30 msec 15-30 msec statement
counter/timer chips Hard µsec µsec statement
logic analyzers Hard nsec µsec statement
oscilloscope Very Hard nsec µsec statement
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Performance Benchmarking - Software Systems Implementation - Handout | ENEE 642 and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

Department of Electrical and Computer Engineering

University of Maryland at College Park

ENEE 642: Software Systems Implementation

Spring 2000, Prof. D. Stewart

Handout 5: Performance Benchmarking

Prior to performing any optimization of execution time after implementation (i.e. fine-grain optimization), it is essential that the code to be optimized is measured first. Otherwise, what might seem like an optimiza- tion might in fact slow down overall execution time of the code. Thus, measure execution time first, then optimize part of the code, then measure execution time again. Compare the results, and determine when or not you really were successful.

Never optimize execution time at the cost of software maintenance (e.g. readability, reusability, etc.) unless it is essential for meeting the hard or soft timing requirements of the application.

In this handout, methods for measuring execution time are first described. Then guidelines and hints on how to identify what code should be optimized to meet the timing specifications are discussed.

1. M EASURING EXECUTION T IME

Before doing any fine-grain optimization, you must measure the execution time of your code. Quite often, fine-grain optimization is architecture specific, and what speeds up code with one compiler/processor pair may actually slow down code with another. In this section, we go over various techniques for measuring how long your code takes.

Once you know how long the code takes, you can then pinpoint those sections that take the longest amount of time, and start with that. If you have two routines, one taking 90% of the CPU time, and the other 10%, no sense optimizing the one that takes only 10%, since there is more potential savings in the other one for the same code optimizing effort.

How can I make my code faster? First important step is to see how long it takes.

Some terminology:

Resolution : Limitations of your timer or timing method. How long is one tick? Accuracy : If you get a timing of x +/- y , then y is your Accuracy. Granularity : What part of the code you can measure. Difficulty : How easy or difficult is it to use this method. More difficult means more time and money!

Many different methods exist to measure execution time. Generally, a method that improves one or two of these dimensions costs in the other dimensions, Following is a list of some methods, in increasing order of difficult for using the tools. Notice how difficulty increases, other parameters decrease.

Method Difficulty Resolution Accuracy Granulariy Stop-watch Easy 0.01 sec 1-sec Program date/time Easy 0.02 sec 0.2 sec Program prof and gprof Moderate 10 msec 20 msec subroutines clock() Moderate 15-30 msec 15-30 msec statement counter/timer chips Hard μsec μsec statement logic analyzers Hard nsec μsec statement oscilloscope Very Hard nsec μsec statement

Each method is described in more detail below.

It is important to remember that the execution time of the code is dependent on the processor on which you are executing, and it could be affected by other events in the system, such as preemption by other processes, interrupts, or input/output (I/O). The more accuracy that is needed in your measurements, the more impor- tant these considerations become.

1.1 Stop-watch:

Only suitable for non-interactive programs, preferably running on single-tasking systems.

Use to time things like numerical code which may take minutes or hours to execute, and when measure- ments only need to be approximations (e.g. to nearest second or minute).

1.2 Date command (UNIX)

The date command is used like a stopwatch, except you use the built-in clock of the computer. This method is always easier to use (and more accurate) than a stop-watch, and so use this command if it is available.

A typical way to wrap your program in a shell script or alias with the following commands: date > output program >> output date >> output

As with the stop-watch method, this will only give you an estimate of how long the full program took to execute. It does not take into consideration preemption, interrupts, or I/O. Most accurate answers are obtained on non-preemptive systems.

This method is useful if the output serves as a log, so that the start and end time of each execution is logged into the file. An example use is for long computer architecture simulations that run in the background and overnight, and you are interested in knowing precisely when it ended.

1.3 Time command (UNIX)

Activate it by prefixing time to your command line. This command not only measures the time between beginning and end of the program, but also computes the execution time used by the specific program, tak- ing into consideration preemption.

The output depends on which version of time that you are using. For example:

Version 1: built-in shell command. % time program 8.400u 0.040s 0:18.40 66.1%

What this means: u=CPU, hence execution time of program is 8.4 seconds. s=system, which is execution time used by the operating system while running your program (such as interrupt handlers). The third item is the total real time that the program used (18.4 seconds). This time is approximately the same as using the date command. The fourth item is the average percentage of CPU time used, which is usually fairly close to the first number divided by the third number. Why it isn’t exact I’m not sure.

Version 2: /usr/bin/time % /usr/bin/time program 18.4 real 8.4 user 0.4 sys

The same code is measured the same way, only the output is formatted differently.

#include <time.h>

clock_t start,finish; double total;

start = clock(); do stuff; finish = clock();

total = (double) (finish - start) / (double) CLK_TCK

printf("Total = %f\n",total);

An issue : clock() is defined differently on every system! Sometimes, it can give you value in μsec, sec, or clock ticks. Check the definition for your OS:

if in μsec: #define CLK_TCK 1000000

if in sec: #define CLK_TCK 1

if in Clock Ticks: #include "time.h" (this defines CLK_TCK)

or #define CLK_TCK sysconf(3)

Type clock_t is defined, and it could be long , double , or anything else, for that matter.

Also, depending on the system, clock() might behave differently if the system is preemptive. In some cases, if the task is preempted, the value returned by clock() will include the time spent by the other task too. In other cases, it will only include time used by its own process.

Note that the Open Software Foundation (OSF) claims that getrusage() replaces clock(). Honestly, I don’t know how, and clock() seems much easier to use. So as long as it is on your system, you can use it.

Note the granularity! Even thouch clock() is in μsec, the granularity is 1/ sysconf(3). On UltraSparcs, this is usually 10 msec. Calling the function sysconf() with the argument ‘3’ returns the value of the system clock.

If more resolution than 10 msec is needed, then one of two approaches can be used:

  • Create a loop around what needs to be measured, that executes 10, 100, or 1000 times or more. Measure execution time to the nearest 10 msec. Then divide that time by the number of times the loop executed. If the loop executed 1000 times using a 10 msec clock, you obtain a resolution of 10 μsec for the loop.
  • Use a hardware method, as described below.

The advantage of the loop method is that it does not require any special hardware. The disadvantage is that it forces a change in the code; the change might affect the functionality, and could even cause the program to crash.

The hardware method requires special tools. But if those tools are available, then they can be used to obtain measurements with microsecond resolution with minimal instrumentation of the code.

1.6 Timer/Counter Chip

Some computers have timer/counter chips that are user programmable. In such a case, you can program them to have μsec resolution, and use a method similar to the clock() method above to start and stop the timer/counter.

1.7 Logic Analyzers

A logic analyzer is one of the most commonly used methods for accurately measuring execution time, espe- cially for embedded systems where accurate timing is often essential. For very fine-grained timing, the code is instrumentized and a signal is sent to an output line on which you can hook up a logic analyzer. E.g. par- allel I/O port or shared bus.

Alternately, if instrumenting the code is not an option, but the address and data lines of the processor are accessible, then these data lines can be monitored for specific addresses. Advanced triggering methods on the analyzer need to be used in this case, and you must disassemble the code to identify the exact addresses to use to begin and end the sequence.

The analyzer is setup to capture the data being sent to the I/O port or over the address lines. Use one event to begin timing, a second event to end it, then measure the length of the signal. Or just capture all the data in state mode and use the time stamps. Logic analyzers with deep buffers can be used to generate an entire timing history of multiple parts of the code all with one execution of the program.

Resolution of most logic analyzers is typically in the 10 to 200 nsec range. Accuracy, however, is usually a little less (e.g. 1 μsec) due to the overhead of instrumenting code.

1.8 Oscilloscope

This method is similar to using logic analyzers, except that with only one or two channels, only a very small portion of code can be measured. The advantage of oscilloscopes over logic analyzers for debugging usually surfaces when debugging possible hardware interface issues, or if the only output port available for debug- ging is analog.

For measuring execution time, use an oscilloscope if a logic analyzer is not available, and you know the program is functioning correctly (i.e. you are not trying to debug the hardware).

2. EXERCISES

The following exercises are optional. Use these to practice measuring execution time.

2.1 EXERCISE 1

Take the dbl program from the exercise of the previous handout, and use both prof and gprof to profile exe- cution time of it. Look at the results. How long does it take to execute dblNew() and dblInsert()? Did you encounter problems?

2.2 EXERCISE 2

If you did the above exercise right, you probably cannot tell, because your results are less than the resolu- tion. Try something different, create an array of 10000 or so random numbers. Then write the following- simple sort routine to sort the numbers. How long does it take?

Normally use -O2. You can use -O3 on, but it is possible that your code breaks with some of the more advanced optimizations.

Most compilers also have lots of other options. Some typical options include unrolling loops, inline subrou- tines, etc. When using –O, each different level is a set that builds upon previous optimizations. But what if the optimization performed with –O4 breaks the code, but the optimization added at level –O6 would improve code? Most compilers allow you to select which optimizations to perform. You can select those that improve code, and leave out those that do not.

For example, if you use GNU gcc, type “man gcc”, and look for the section on “Optimization Options”. You will be surprised how many options exist!

In most Windows-based development environments, go to the "Options" window, select compiler options, and you can view different optimizing options, either selecting or deselecting each one.

5. LEARNING YOUR PROCESSOR

Many times, you can improve your ability to quickly optimize code by knowing how long individual instructions take on a specific architecture.

For example,

How long is a 32-bit addition on a 10MHz Motorola 6805 processor?

Do you have any clue? Probably not. How can you be expected to give estimates of how much CPU time your code will take?

To optimize code, and in fact to minimize the need for optimizing code if you are implementing a design, you should be fully aware of the impact on CPU time and memory for EVERY LINE OF CODE YOU WRITE!!!

As an example, anytime you work on a new processor, take a a few hours to benchmark all of the C instruc- tions that you usually use. The "machine cycle time" means nothing, as that is how long it takes each assem- bly language instruction. A C instruction can easily take dozens, or even a hundred assembly instructions!!

An example of such benchmarking that we did of a 9 MHz Z180 embedded processor will be posted on the class website.

5.1 EXERCISE 4

Look at the results of the Z180 benchmarking VERY CLOSELY, and pick out all of the anomalies. An "anomaly" is when a certain execution takes much more, or much less, time than you expected, once you have knowledge of how long it takes for a basic 8 and 16-bit additions.

6. G UIDELINES AND HINTS FOR FINE -GRAIN O PTIMIZATION

Ultimately, trial and error using techniques described above will yield what fine-grain optimizations work on your compiler/architecture.

This section is designed to give you a feel for some of the kinds of optimizations that you are most likely to encounter, either because you perform them or someone else did. A good optimization improves execution time without affecting readability. A bad optimization does not improve execution time (or makes it worse). Some optimizations are trade-offs, where you must sacrifice readability for performance. However, never sacrifice reusability or configurability for performance if those aspects are essential requirements.

6.1 Start with functions that use the most time

Using a method such as prof or gprof , you can quickly identify the functions that use the most execution time. The more execution time a function uses, the more likely it will be possible to optimize it. Once you have identified these functions, look for the constructs described in the rest of this section, then follow the hints for optimizing (or staying away from) those constructs.

6.2 Loops

Saving 1-μsec in sequential code saves 1 μsec

Saving 1-μsec in a loop saves N μsec, where N=number of times loop is performed.

Therefore, it is worthwhile to add code outside the loop, in order to save inside a loop. Here is a simple example:

slower:

for (i=0;i<N;++i) { if (method == 1) do this else do that }

faster:

if (method == 1) { for (i=0;i<N;++i) { do this; } else for (i=0;i<N;++i) { do that; }

The second method improves execution time, but at a small cost of increased code size.

6.3 Unrolling loops

Unrolling loops can be used to optimize code. For example:

for (i=0;i<N;++i) dothis(A[i]);

vs.

for (i=0;i<N;i+=4) { dothis(A[i]); dothis(A[i+1]); dothis(A[i+2]); dothis(A[i+3]); }

Macros are difficult to debug (you cannot step through them with a symbolic debugger), and very difficult to maintain. Only use them for really short functions or if the functionality you need cannot be obtained by subroutines. As the functionality increases, the overhead of the subroutine becomes more negligible.

6.8 Block copying

When copying large amounts of memory, use block copying functions in the library rather than your own. For example, ANSI-C defines the following: memcpy() : Fastest, but doesn’t work on overlapping copies. memmove() : Slower than memcpy, but handles overlapping.

Often, a lot of time is spent optimizing these routines, and so it can be extremely difficult to do any better.

These functions usually gain speed by copying “chunks” of code with built-in unrolling of loops. To obtain the best optimization using these commands, try to copy in multiples of 2n^ , with the memory block to be copied on an address boundary of 2 n.

6.9 Block filling

As with block copying, use the ANSI-C library routine memset().

6.10 Compound conditions

If code includes many compound conditions, such as if ( A && B ) then xxx if ( A || B ) then xxx

Then code can be improved by taking advantage that in C, the second condition is only executed if neces- sary. That means, in the above comparisons: A && B, B only evaluated if A is TRUE. A || B, B only evaluated if A is FALSE.

The strategy is thus one of the following:

  • Put shorter condition first.
  • If &&, then put the one that is usually false first.
  • If ||, then put the one that is usually true first.

Note that optimizations at this level only make a difference if they are part of large loops or if the condition is lengthy (e.g. involves calling a subroutine).

6.11 Lots ofstrcmp().

Code that deals with lots of strings often requires many string comparisons. It is not uncommon to see code that appears as follows:

if (strcmp(command,”set”) == 0) { do command set } else if (strcmp(command,”get”) == 0) { do command get } else if (strcmp(command,”help”) == 0) { do command help } else if ( ...) etc.

Code of this form is a candidate for huge improvements in execution time by building decision trees. Refer to a Data Structures or Algorithms textbook for details.

As a simple example for improving the above. Instead of comparing complete strings linearly, use only the first character of command. This value can be used either as the outer switch statement in nested switch, or as an index into a decision tree to quickly point only to commands that start with this first letter. Depending on how many commands, multiple levels of decision making can be created, by looking at the second, third, etc. letters independently. This method has a further advantage that it will accept commands given as abbre- viations, as long as the non-ambiguous part of the command is provided. So instead of issuing the command “help”, simply typing “he” (or just “h”) might be sufficient.

6.12 Arrays vs. Pointers

The interchangeability between arrays and pointers at the fine-grain level is a possible area of improvement for execution time. Here are some things to consider when using arrays and you need to optimize for speed.

As a general rule, accesses are sequential, use pointers. If accesses to the array are random, then access arrays using their indices.

Array indices are simply offsets from the base pointer. The computation of an index is faster when using the [] notation, but incrementing a pointer (e.g. ptr ++) is faster than adding the offset to it every time. That is why for sequential access, use pointers and increment them. For random access, use the array notation if possible. I.e.: *A[i] => (A + i) A[i][j] => (A + Ni + j)

It is usually faster to perform A[i] than *(A+i) , but saying ptr=A once, and doing ptr++ each time, is faster than saying A[i++] each time.

Note that some compilers might do this kind of optimization for you, in which case use the method that you believe is most readable.

6.13 Functions in Conditions

Quite often, a loop or function can be terminated early to save execution time. For example, consider the code segment:

if (sum(A) > 100) { do stuff }

What if array A has 1000 elements, and the sum of the first two elements is 101. The rest of the sum is com- puted for nothing. In such a case, it might be worthwhile to compute the sum without using the function, rather than calling the function. This might be at the cost of readability (or maybe not), but if minimizing execution time is a priority, this type of code is a candidate.

6.14 Searching

The basic rule is don’t search for nothing.

If you have a key or index, USE IT! This provides O(1) access to the data, and will be fastest. Lookup and Jump tables are built around this concept.

If you are doing a lot of searching, consider sorting the data first. In many applications, searching occurs many more times than changes in the data, so sorting it once can greatly improve the search.

6.18 Computations

Addition, subtraction, and multiplication all take about the same amount of time on most processors. Divi- sion is usually much slower, but not always. Remember that a/b is the same as a*(1/b). Suppose b is a con- stant (e.g. 100.0), then doing a*0.01 will be faster than a/100..

Integers are usually much faster than floating point. 32-bit floating point is much faster than 64-bit doubles. For integers, however, 16-bit is not necessarily faster than 32-bit on a 32-bit processor. Quite often, the fast- est integer computations occur when items are the size of the processor’s data bus and ALU.

Many people make the mistake of converting operations such as a*64 to a<<6. On many architectures, shift- ing by more than 1 location is slower than multiplication! Furthermore, many modern compilers will auto- matically do this optimization for you; so write the code the way it is most readable.

6.19 Assembly Language

Assembly language is usually not needed with modern optimizing compilers, as the compilers can often do as good a job as all but the best experts.

If using an older compiler, however, or if for some reason your application prevents you from optimizing code (assuming you already tried using volatile at the right places), then it might be necessary. Many com- pilers for special processors (like DSPs) also do not optimize as well as humans, but they are getting better.

Use assembly language to optimize code as a last resort! Use the compiler optimizations to get nearly the same results.

6.20 Compromise Algorithms

Sometimes consider multiple designs for an algorithm, and depending on the characteristics of the data when the routine is called, pick the algorithm that is best suited for that case. For example, the fread() func- tion uses buffered I/O to improve the speed of reading data from disk into memory. But what if the data blocks being requested are bigger than the buffer size? Then the buffer actually slows down the process as compared to just doing a read() system call.

These kinds of problems can be rectified by using known data or arguments to select the algorithm. For example, in the function fread() , the number of bytes to read can be computed from the arguments. If this value is small, then read a block into the internal buffer. If it is large, then bypass the internal buffer, and copy data read directly into the user’s memory.

MOST IMPORTANTLY!!!!!! BE CREATIVE!!!!!!!! AND THINK!!!!!!!

Sometimes looking at a function from a different point of view or by taking a step back can significantly improve how long the code takes.

Get used to timing lots of different things. Through trial and error, you’ll gain the experience to see what helps, and what hurts, and you’ll learn to quickly spot code that can be optimized.