Download Analysis - High Performance Embedded Systems Design - Lecture Slides and more Slides Computer Science in PDF only on Docsity!
6-Performance Analysis of
Embedded System Designs:
Digital Camera Case Study (cont.)
Outline
⢠Introduction to a simple digital camera
⢠Designerās perspective
⢠Requirements specification
⢠Design
ā Four implementations
Introduction to a simple digital camera
- Captures images
- Stores images in digital format
- No film
- Multiple images stored in camera
- Number depends on amount of memory and bits used per image
- Downloads images to PC
- Only recently possible
- Systems-on-a-chip
- Multiple processors and memories on one IC
- High-capacity flash memory
- Very simple description used for example
- Many more features with real digital camera
- Variable size images, image deletion, digital stretching, zooming in and out, etc.
Designerās perspective
⢠Two key tasks
ā Processing images and storing in memory
⢠When shutter pressed:
ā Image captured
ā Converted to digital form by charge-coupled device (CCD)
ā Compressed and archived in internal memory
ā Uploading images to PC
⢠Digital camera attached to PC
⢠Special software commands camera to transmit archived images
serially
CCDPP (CCD PreProcessing) module
- Performs zero-bias adjustment
- CcdppCapture uses CcdCapture and CcdPopPixel to obtain image
- Performs zero-bias adjustment after each row read in
#define SZ_ROW 64
#define SZ_COL 64
static char buffer[SZ_ROW][SZ_COL];
static unsigned rowIndex, colIndex;
void CcdppInitialize() {
rowIndex = -1;
colIndex = -1;
void CcdppCapture(void) {
char bias;
CcdCapture();
for(rowIndex=0; rowIndex<SZ_ROW; rowIndex++) {
for(colIndex=0; colIndex<SZ_COL; colIndex++) {
buffer[rowIndex][colIndex] = CcdPopPixel();
bias = (CcdPopPixel() + CcdPopPixel()) / 2;
for(colIndex=0; colIndex<SZ_COL; colIndex++) {
buffer[rowIndex][colIndex] -= bias;
rowIndex = 0;
colIndex = 0;
char CcdppPopPixel(void) {
char pixel;
pixel = buffer[rowIndex][colIndex];
if( ++colIndex == SZ_COL ) {
colIndex = 0;
if( ++rowIndex == SZ_ROW ) {
colIndex = -1;
rowIndex = -1;
return pixel;
UART module
⢠Actually a half UART
ā Only transmits, does not receive
⢠UartInitialize is passed name of file to output to
⢠UartSend transmits (writes to output file) bytes at a time
#include <stdio.h>
static FILE *outputFileHandle;
void UartInitialize(const char *outputFileName) {
outputFileHandle = fopen(outputFileName, "w");
void UartSend(char d) {
fprintf(outputFileHandle, "%i\n", (int)d);
CODEC module
- Models FDCT encoding
- ibuffer holds original 8 x 8 block
- obuffer holds encoded 8 x 8 block
- CodecPushPixel called 64 times to fill ibuffer
with original block
- CodecDoFdct called once to transform 8 x 8
block
- Explained in next slide
- CodecPopPixel called 64 times to retrieve
encoded block from obuffer
static short ibuffer[8][8], obuffer[8][8], idx;
void CodecInitialize(void) { idx = 0; }
void CodecDoFdct(void) {
int x, y; for(x=0; x<8; x++) { for(y=0; y<8; y++) obuffer[x][y] = FDCT(x, y, ibuffer); } idx = 0; }
void CodecPushPixel(short p) { if( idx == 64 ) idx = 0; ibuffer[idx / 8][idx % 8] = p; idx++; }
short CodecPopPixel(void) { short p; if( idx == 64 ) idx = 0; p = obuffer[idx / 8][idx % 8]; idx++; return p; }
CNTRL (controller) module
- Heart of the system
- CntrlInitialize for consistency with other modules only
- CntrlCaptureImage uses CCDPP module to input image and place in buffer
- CntrlCompressImage breaks the 64 x 64 buffer into 8 x 8 blocks and performs FDCT on each block using the CODEC module - Also performs quantization on each block
- CntrlSendImage transmits encoded image serially using UART module
void CntrlSendImage(void) { for(i=0; i<SZ_ROW; i++) for(j=0; j<SZ_COL; j++) { temp = buffer[i][j]; UartSend(((char)&temp)[0]); / send upper byte / UartSend(((char)&temp)[1]); /* send lower byte */ } } }
#define SZ_ROW 64 #define SZ_COL 64 #define NUM_ROW_BLOCKS (SZ_ROW / 8) #define NUM_COL_BLOCKS (SZ_COL / 8) static short buffer[SZ_ROW][SZ_COL], i, j, k, l, temp;
void CntrlInitialize(void) {}
void CntrlCaptureImage(void) { CcdppCapture(); for(i=0; i<SZ_ROW; i++)
for(j=0; j<SZ_COL; j++) buffer[i][j] = CcdppPopPixel(); }
void CntrlCompressImage(void) { for(i=0; i<NUM_ROW_BLOCKS; i++) for(j=0; j<NUM_COL_BLOCKS; j++) {
for(k=0; k<8; k++) for(l=0; l<8; l++) CodecPushPixel( (char)buffer[i * 8 + k][j * 8 + l]); CodecDoFdct();/* part 1 - FDCT */
for(k=0; k<8; k++) for(l=0; l<8; l++) { buffer[i * 8 + k][j * 8 + l] = CodecPopPixel(); /* part 2 - quantization / buffer[i8+k][j*8+l] >>= 6; } } }
Design
- Determine systemās architecture
- Processors
- Any combination of single-purpose (custom or standard) or general-purpose processors
- Memories, buses
- Map functionality to that architecture
- Multiple functions on one processor
- One function on one or more processors
- Implementation
- A particular architecture and mapping
- Solution space is set of all implementations
- Starting point
- Low-end general-purpose processor connected to flash memory
- All functionality mapped to software running on processor
- Usually satisfies power, size, and time-to-market constraints
- If timing constraint not satisfied then later implementations could:
- use single-purpose processors for time-critical functions
- rewrite functional specification
Implementation 1: Microcontroller
alone
⢠Low-end processor could be Intel 8051 microcontroller
⢠Total IC cost including NRE about $
⢠Well below 200 mW power
⢠Time-to-market about 3 months
⢠However, one image per second not possible
- 12 MHz, 12 cycles per instruction
- Executes one million instructions per second
- CcdppCapture has nested loops resulting in 4096 (64 x 64) iterations
- ~100 assembly instructions each iteration
- 409,000 (4096 x 100) instructions per image
- Half of budget for reading image alone
- Would be over budget after adding compute-intensive DCT and Huffman encoding
Microcontroller
- Synthesizable version of Intel 8051 available
- Written in VHDL
- Captured at register transfer level (RTL)
- Fetches instruction from ROM
- Decodes using Instruction Decoder
- ALU executes arithmetic operations
- Source and destination registers reside in RAM
- Special data movement instructions used to load and
store externally
- Special program generates VHDL description of ROM
from output of C compiler/linker
To External Memory Bus
Controller
4K ROM
RAM
Instruction Decoder
ALU
Block diagram of Intel 8051 processor core
UART
- UART in idle mode until invoked
- UART invoked when 8051 executes store instruction with UARTās
enable register as target address
- Memory-mapped communication between 8051 and all
single-purpose processors
- Lower 8-bits of memory address for RAM
- Upper 8-bits of memory address for memory-mapped I/O
devices
- Start state transmits 0 indicating start of byte transmission then
transitions to Data state
- Data state sends 8 bits serially then transitions to Stop state
- Stop state transmits 1 indicating transmission done then
transitions back to idle mode
invoked
I = 8
I < 8
Idle
I = 0
Start :
Transmi t LOW
Data :
Transmit data(I), then I++
Stop :
Transmi t HIGH
FSMD description of UART
Connecting SOC components
- Memory-mapped
- All single-purpose processors and RAM are connected to 8051ās memory bus
- Read
- Processor places address on 16-bit address bus
- Asserts read control signal for 1 cycle
- Reads data from 8-bit data bus 1 cycle later
- Device (RAM or SPP) detects asserted read control signal
- Checks address
- Places and holds requested data on data bus for 1 cycle
- Write
- Processor places address and data on address and data bus
- Asserts write control signal for 1 clock cycle
- Device (RAM or SPP) detects asserted write control signal
- Checks address bus
- Reads and stores data from data bus
Software
- System-level model provides majority of code
- Module hierarchy, procedure names, and main program unchanged
- Code for UART and CCDPP modules must be redesigned
- Simply replace with memory assignments
- xdata used to load/store variables over external memory bus
- at specifies memory address to store these variables
- Byte sent to U_TX_REG by processor will invoke UART
- U_STAT_REG used by UART to indicate its ready for next byte
- UART may be much slower than processor
- Similar modification for CCDPP code
- All other modules untouched
static unsigned char xdata U_TX_REG at 65535; static unsigned char xdata U_STAT_REG at 65534; void UARTInitialize(void) {} void UARTSend(unsigned char d) { while( U_STAT_REG == 1 ) { /* busy wait */ } U_TX_REG = d; }
Rewritten UART module
#include <stdio.h> static FILE *outputFileHandle; void UartInitialize(const char *outputFileName) { outputFileHandle = fopen(outputFileName, "w"); } void UartSend(char d) { fprintf(outputFileHandle, "%i\n", (int)d); }
Original code from system-level model