Analysis - High Performance Embedded Systems Design - Lecture Slides, Slides of Computer Science

These are the Lecture Slides of High Performance Embedded Systems Design which includes Performance Analysis, Digital Camera Case Study, Simple Digital Camera, Requirements Specification, Design, Four Implementations, Custom, Standard, Memory etc. Key important points are: Analysis, Performance Analysis, Digital Camera Case Study, Simple Digital Camera, Requirements Specification, Design, Four Implementations, Custom, Standard, Memory

Typology: Slides

2012/2013

Uploaded on 03/22/2013

dhritiman
dhritiman šŸ‡®šŸ‡³

4.7

(6)

106 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
6-Performance Analysis of
Embedded System Designs:
Digital Camera Case Study (cont.)
1
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Analysis - High Performance Embedded Systems Design - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

6-Performance Analysis of

Embedded System Designs:

Digital Camera Case Study (cont.)

Outline

• Introduction to a simple digital camera

• Designer’s perspective

• Requirements specification

• Design

– Four implementations

Introduction to a simple digital camera

  • Captures images
  • Stores images in digital format
    • No film
    • Multiple images stored in camera
      • Number depends on amount of memory and bits used per image
  • Downloads images to PC
  • Only recently possible
    • Systems-on-a-chip
      • Multiple processors and memories on one IC
  • High-capacity flash memory
  • Very simple description used for example
    • Many more features with real digital camera
      • Variable size images, image deletion, digital stretching, zooming in and out, etc.

Designer’s perspective

• Two key tasks

– Processing images and storing in memory

• When shutter pressed:

– Image captured

– Converted to digital form by charge-coupled device (CCD)

– Compressed and archived in internal memory

– Uploading images to PC

• Digital camera attached to PC

• Special software commands camera to transmit archived images

serially

CCDPP (CCD PreProcessing) module

  • Performs zero-bias adjustment
  • CcdppCapture uses CcdCapture and CcdPopPixel to obtain image
  • Performs zero-bias adjustment after each row read in

#define SZ_ROW 64

#define SZ_COL 64

static char buffer[SZ_ROW][SZ_COL];

static unsigned rowIndex, colIndex;

void CcdppInitialize() {

rowIndex = -1;

colIndex = -1;

void CcdppCapture(void) {

char bias;

CcdCapture();

for(rowIndex=0; rowIndex<SZ_ROW; rowIndex++) {

for(colIndex=0; colIndex<SZ_COL; colIndex++) {

buffer[rowIndex][colIndex] = CcdPopPixel();

bias = (CcdPopPixel() + CcdPopPixel()) / 2;

for(colIndex=0; colIndex<SZ_COL; colIndex++) {

buffer[rowIndex][colIndex] -= bias;

rowIndex = 0;

colIndex = 0;

char CcdppPopPixel(void) {

char pixel;

pixel = buffer[rowIndex][colIndex];

if( ++colIndex == SZ_COL ) {

colIndex = 0;

if( ++rowIndex == SZ_ROW ) {

colIndex = -1;

rowIndex = -1;

return pixel;

UART module

• Actually a half UART

– Only transmits, does not receive

• UartInitialize is passed name of file to output to

• UartSend transmits (writes to output file) bytes at a time

#include <stdio.h>

static FILE *outputFileHandle;

void UartInitialize(const char *outputFileName) {

outputFileHandle = fopen(outputFileName, "w");

void UartSend(char d) {

fprintf(outputFileHandle, "%i\n", (int)d);

CODEC module

  • Models FDCT encoding
  • ibuffer holds original 8 x 8 block
  • obuffer holds encoded 8 x 8 block
  • CodecPushPixel called 64 times to fill ibuffer

with original block

  • CodecDoFdct called once to transform 8 x 8

block

  • Explained in next slide
  • CodecPopPixel called 64 times to retrieve

encoded block from obuffer

static short ibuffer[8][8], obuffer[8][8], idx;

void CodecInitialize(void) { idx = 0; }

void CodecDoFdct(void) {

int x, y; for(x=0; x<8; x++) { for(y=0; y<8; y++) obuffer[x][y] = FDCT(x, y, ibuffer); } idx = 0; }

void CodecPushPixel(short p) { if( idx == 64 ) idx = 0; ibuffer[idx / 8][idx % 8] = p; idx++; }

short CodecPopPixel(void) { short p; if( idx == 64 ) idx = 0; p = obuffer[idx / 8][idx % 8]; idx++; return p; }

CNTRL (controller) module

  • Heart of the system
  • CntrlInitialize for consistency with other modules only
  • CntrlCaptureImage uses CCDPP module to input image and place in buffer
  • CntrlCompressImage breaks the 64 x 64 buffer into 8 x 8 blocks and performs FDCT on each block using the CODEC module - Also performs quantization on each block
  • CntrlSendImage transmits encoded image serially using UART module

void CntrlSendImage(void) { for(i=0; i<SZ_ROW; i++) for(j=0; j<SZ_COL; j++) { temp = buffer[i][j]; UartSend(((char)&temp)[0]); / send upper byte / UartSend(((char)&temp)[1]); /* send lower byte */ } } }

#define SZ_ROW 64 #define SZ_COL 64 #define NUM_ROW_BLOCKS (SZ_ROW / 8) #define NUM_COL_BLOCKS (SZ_COL / 8) static short buffer[SZ_ROW][SZ_COL], i, j, k, l, temp;

void CntrlInitialize(void) {}

void CntrlCaptureImage(void) { CcdppCapture(); for(i=0; i<SZ_ROW; i++)

for(j=0; j<SZ_COL; j++) buffer[i][j] = CcdppPopPixel(); }

void CntrlCompressImage(void) { for(i=0; i<NUM_ROW_BLOCKS; i++) for(j=0; j<NUM_COL_BLOCKS; j++) {

for(k=0; k<8; k++) for(l=0; l<8; l++) CodecPushPixel( (char)buffer[i * 8 + k][j * 8 + l]); CodecDoFdct();/* part 1 - FDCT */

for(k=0; k<8; k++) for(l=0; l<8; l++) { buffer[i * 8 + k][j * 8 + l] = CodecPopPixel(); /* part 2 - quantization / buffer[i8+k][j*8+l] >>= 6; } } }

Design

  • Determine system’s architecture
    • Processors
      • Any combination of single-purpose (custom or standard) or general-purpose processors
    • Memories, buses
  • Map functionality to that architecture
    • Multiple functions on one processor
    • One function on one or more processors
  • Implementation
    • A particular architecture and mapping
    • Solution space is set of all implementations
  • Starting point
    • Low-end general-purpose processor connected to flash memory
      • All functionality mapped to software running on processor
      • Usually satisfies power, size, and time-to-market constraints
      • If timing constraint not satisfied then later implementations could:
        • use single-purpose processors for time-critical functions
        • rewrite functional specification

Implementation 1: Microcontroller

alone

• Low-end processor could be Intel 8051 microcontroller

• Total IC cost including NRE about $

• Well below 200 mW power

• Time-to-market about 3 months

• However, one image per second not possible

  • 12 MHz, 12 cycles per instruction
    • Executes one million instructions per second
  • CcdppCapture has nested loops resulting in 4096 (64 x 64) iterations
    • ~100 assembly instructions each iteration
    • 409,000 (4096 x 100) instructions per image
    • Half of budget for reading image alone
  • Would be over budget after adding compute-intensive DCT and Huffman encoding

Microcontroller

  • Synthesizable version of Intel 8051 available
    • Written in VHDL
    • Captured at register transfer level (RTL)
  • Fetches instruction from ROM
  • Decodes using Instruction Decoder
  • ALU executes arithmetic operations
    • Source and destination registers reside in RAM
  • Special data movement instructions used to load and

store externally

  • Special program generates VHDL description of ROM

from output of C compiler/linker

To External Memory Bus

Controller

4K ROM

RAM

Instruction Decoder

ALU

Block diagram of Intel 8051 processor core

UART

  • UART in idle mode until invoked
    • UART invoked when 8051 executes store instruction with UART’s

enable register as target address

  • Memory-mapped communication between 8051 and all

single-purpose processors

  • Lower 8-bits of memory address for RAM
  • Upper 8-bits of memory address for memory-mapped I/O

devices

  • Start state transmits 0 indicating start of byte transmission then

transitions to Data state

  • Data state sends 8 bits serially then transitions to Stop state
  • Stop state transmits 1 indicating transmission done then

transitions back to idle mode

invoked

I = 8

I < 8

Idle

I = 0

Start :

Transmi t LOW

Data :

Transmit data(I), then I++

Stop :

Transmi t HIGH

FSMD description of UART

Connecting SOC components

  • Memory-mapped
    • All single-purpose processors and RAM are connected to 8051’s memory bus
  • Read
    • Processor places address on 16-bit address bus
    • Asserts read control signal for 1 cycle
    • Reads data from 8-bit data bus 1 cycle later
    • Device (RAM or SPP) detects asserted read control signal
    • Checks address
    • Places and holds requested data on data bus for 1 cycle
  • Write
    • Processor places address and data on address and data bus
    • Asserts write control signal for 1 clock cycle
    • Device (RAM or SPP) detects asserted write control signal
    • Checks address bus
    • Reads and stores data from data bus

Software

  • System-level model provides majority of code
    • Module hierarchy, procedure names, and main program unchanged
  • Code for UART and CCDPP modules must be redesigned
    • Simply replace with memory assignments
      • xdata used to load/store variables over external memory bus
      • at specifies memory address to store these variables
      • Byte sent to U_TX_REG by processor will invoke UART
      • U_STAT_REG used by UART to indicate its ready for next byte
        • UART may be much slower than processor
    • Similar modification for CCDPP code
  • All other modules untouched

static unsigned char xdata U_TX_REG at 65535; static unsigned char xdata U_STAT_REG at 65534; void UARTInitialize(void) {} void UARTSend(unsigned char d) { while( U_STAT_REG == 1 ) { /* busy wait */ } U_TX_REG = d; }

Rewritten UART module

#include <stdio.h> static FILE *outputFileHandle; void UartInitialize(const char *outputFileName) { outputFileHandle = fopen(outputFileName, "w"); } void UartSend(char d) { fprintf(outputFileHandle, "%i\n", (int)d); }

Original code from system-level model