Prologue TO DIGITAL SIGNAL PROCESSORS Accumulator design Memory-enlist engineering Prof. Brian L. Evans as a team with Niranjan Damera-Venkata and Magesh Valliappan Embedded Signal Processing Laboratory The University of Texas at Austin, TX 78712-1084 http://signal.ece.utexas.edu/Load-store design
Slide 2Outline Signal handling applications Conventional DSP engineering Pipelining in DSP processors RISC versus DSP processor designs TI TMS320C6x VLIW DSP engineering Signal and picture preparing applications Signal handling on broadly useful processors Conclusion
Slide 3Signal Processing Applications Low-cost implanted frameworks Modems, cell phones, circle drives, printers High-throughput applications Halftoning, base stations, 3-D sonar, tomography PC based sight and sound Compression/decompression of sound, illustrations, video Embedded processor necessities Inexpensive with little range and volume Deterministic intrude on administration routine idleness Low power: ~50 mW (TMS320C54x utilizes 0.36 m A/MIP)
Slide 4Conventional DSP Architecture Harvard design Separate information memory/transport and program memory/transport Three peruses and maybe a couple composes per direction cycle Deterministic interfere with administration routine inertness Multiply-collect in single guideline cycle Special tending to modes upheld in equipment Modulo tending to for round supports (e.g. FIR channels) Bit-turned around tending to (e.g. quick Fourier changes) Instructions to keep the pipeline (3-4 phases) full Zero-overhead circling (one pipeline flush to set up) Delayed branches
Slide 5Conventional DSP Architecture (con't) Data-moving Modulo tending to actualizing round cradles and postpone lines Time Buffer substance Next specimen x N-K+1 x N-1 x N+1 x N-K+1 x N n=N x N-K+2 x N x N-K+3 x N+1 n=N+1 x N+2 x N-K+3 x N+1 x N-K+4 x N+2 n=N+2 x N+3 Modulo tending to Time Next specimen Buffer substance Bit switched tending to used to actualize the radix-2 FFT n=N x N-2 x N-K+1 x N x N+1 x N-1 x N-K+2 x N+2 x N-2 x N+1 x N x N x N-K+3 x N-1 x N-K+2 n=N+1 x N-2 x N+1 x N x N-1 x N x N+2 x N-K+3 x N-K+4 x N-K+4 n=N+2 x N+3
Slide 6Conventional DSP Architecture (con't)
Slide 7Conventional DSP Architecture (con't) Market share: 95% settled point, 5% skimming point Each processor family has many individuals with various on-chip designs Size and guide of information and program memory A/D, input/yield cushions, interfaces, clocks, and D/A Drawbacks to routine DSP processors No byte tending to (required for picture and video) Limited on-chip memory Limited addressable memory on altered point DSPs, aside from Motorola 56300 (16 Mw information; 64 Mw program) Non-standard C expansions to bolster settled point information
Slide 8Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most ordinary DSP processors) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Managing Pipelines compiler or software engineer pipeline interlocking in the processor equipment direction booking Fetch Decode Read Execute Superpipelined (CDC7600) Fetch Decode Read Execute
Slide 9Pipelining: Operation Fetch Decode Read Time-stationary pipeline demonstrate Programmer controls every cycle Motorola DSP56001 Data-stationary pipeline display Programmer indicates information operations TMS320C30/40 Interlocked pipeline Programmer is "ensured" from pipeline impacts Execute F D R E C D E F G H I J K - L D E F G H I J K L B C D E F G H I J K - L A B C D E F G H I J K - L MAC X0,Y0,A X:(R0)+,X0 Y:(R4)- ,Y0 MPYF *++AR0(1),*++AR1(IR0),R0
Slide 10Pipelining: Hazards Fetch Decode Read A control risk happens when a branch guideline is decoded "Flush" the pipeline or: Delayed branch (uncover pipeline) An information danger happens on the grounds that an operand can't be perused yet Intended by developer or: Interlock equipment embeds "bubble" Execute F D R E D E F br G - X Y Z CD E F br - X - Y Z BCD E F br - X - Y Z ABCD E F br - X - Y Z TMS320C5x case LAC #064h SAMM AR2 NOP LACC *-LAR AR2, DATA LACC *-
Slide 11Pipelining: Avoiding Control Hazards Fetch Decode Read A rehash direction rehashes one direction or a piece of guidelines after rehash The pipeline is loaded with rehashed direction (or square of guidelines) Cost: one pipeline flush just Execute A key consider the numeric execution of DSPs is the arrangement of unique equipment to perform circling. F D R E D E F rpt X C D E F rpt - X B CD E F rpt - X ABCD E F rpt - X RPT COUNT TBLR *+
Slide 12RISC versus DSP: Instruction Encoding RISC: Superscalar Reorder Load/store FP Unit Integer Unit DSP: Horizontal microcode Load/store Load/store Address ALU Multiplier
Slide 13RISC versus DSP: Memory Hierarchy RISC Registers I/D Cache Physical memory Out of request TLB: Translation Lookaside Buffer Internal recollections I Cache DSP Registers External recollections DMA Controller DMA: Direct Memory Access
Slide 14TI TMS320C6x VLIW DSP Architecture Simplified Architecture Program RAM Data RAM or Cache Addr Internal Busses DMA Serial Port Host Port Boot Load Timers Pwr Down Data .D1 .D2 .M1 .M2 External Memory - Sync - Async Regs (A0-A15) Regs (B0-B15) .L1 .L2 .S1 .S2 Control Regs CPU
Slide 15TI TMS320C6x VLIW DSP Architecture Two parallel information ways with single-cycle units: Data unit - 32-bit address figurings (modulo, straight) Multiplier unit - 16 bit x 16 bit with 32-bit result Logical unit - 40-bit (immersion) number juggling & thinks about Shifter unit - 32-bit whole number ALU and 40-bit shifter 16 32-bit enlists in every information way 40 bits can be put away in contiguous even/odd registers Fixed-point (C62x) and drifting point (C67x) TMS320C6201: $25 in volume 150 MHz, 300 million MACs/sec, 1200 RISC MIPS On-chip memory: 16 k x 32 program, 32 k x 16 information
Slide 16TI TMS320C6x VLIW DSP Architecture One guideline cycle each clock cycle Deep pipeline 7-11 arranges in C62x: bring 4, interpret 2, execute 1-5 7-16 organizes in C67x: bring 4, unravel 2, execute 1-10 If a branch is in the pipeline, hinders are debilitated (the dormancy of a branch is 5 cycles) Avoid branches by utilizing restrictive execution No equipment insurance against pipeline risks Compiler and constructing agent must counteract pipeline perils C67x processes coasting point duplicate in 4 cycles
Slide 17C5x and C6x Addressing Modes Immediate The operand is a piece of the guideline Register The operand is determined in an enroll Direct The address of the operand is a piece of the direction (added to suggest memory page) Indirect The address of the operand is put away in an enlist TMS320C5x TMS320C6x ADD #0FFh include .L1 - 13,A1,A6 (inferred) include .L1 A7,A6,A7 ADD 010h not upheld ADD * ldw .L1 *A5++[8],A1
Slide 18TMS320C6x versus Pentium MMX BDTImarks : Berkeley Design Technology Inc. DSP benchmark comes about (bigger means better) http://www.bdti.com/bdtimark/results.htm http://www.ece.utexas.edu/~bevans/courses/ee382c/addresses/processors.html
Slide 19Application: FIR Filter z - 1 z - 1 z - 1 Each tap requires Fetching one information test Fetching one operand Multiplying two numbers Accumulating increase result Shifting one specimen in the defer line Computing a FIR tap in one direction cycle Three information memory gets to Auto-augmentation or decrement tending to modes Modulo tending to actualize postpone line as round support
Slide 20Application: FIR Filter on a TMS320C5x Coefficients Data COEFFP .set 02000h ; Program mem address X .set 037Fh ; Newest information test LASTAP .set 037FH ; Oldest information test … LAR AR3, #LASTAP ; Point to most seasoned example RPT #127 MACD COEFFP, *-; Do the thing APAC SACH Y,1 ; Store result - note move
Slide 21Application: FIR Filter on a TMS320C62x Coefficients Data Single-Cycle Loop ... C7: ldh .D1 *A1++, A2 ; Read coefficient || ldh .D2 *B1++, B2 ; Read information || [B0] sub .L2 B0, 1, B0 ; Decrement counter || [B0] B .S2 c7 ; Branch if not zero || mpy .M1x A2, B2, A3 ; Form item || include .L1 A4, A3, A4 ; Accumulate result ...
Slide 22Ordered Dithering on a TMS320C62x occasional cluster of limits 1/8 5/8 7/8 3/8 7/8 3/8 1/8 5/8 Throughput of two cycles ; evacuate next two lines if edges in direct exhibit MVK .S1 0x0001,AMR ; modulo piece measure 2^2 MVKH .S1 0x4000,AMR ; modulo addr reg B6 ; instate A6 and B6 .trip 100 ; least circle check dith : LDB .D1 *A6++,A4 ; read pixel || LDB .D2 *B6++,B4 ; read edge || CMPGTU .L1x A4,B4,A1 ; edge pixel || ZERO .S1 A5 ; 0 if <= edge [A1] MVK .S1 255,A5 ; 255 if > edge || STB .D1 A5,*A6++ ; store result ||[B0] SUB .L2 B0,1,B0 ; decrement counter ||[B0] B .S2 dith ; branch if not zero
Slide 23DSP Cores ASIC with: Programmable DSP RAM ROM Standard cells Codec Peripherals Gate cluster Microcontroller
Slide 24DSP on General Purpose Processors Multimedia applications on PCs Video, sound, design and activity Repetitive parallel arrangements of guidelines Native flag handling illustrations Sun Visual Instruction Set (UltraSPARC 1/2) Intel MMX (Pentium I/II/III) Intel Concurrent SIMD-FP (Pentium III) Single Instruction Multiple Data (SIMD) One direction follows up on various information in parallel Well-suited for representation
Slide 25DSP on General Purpose Processors (con't) Programming is impressively harder C/C ++ compilers don't produce local flag preparing code aside from Metrowerks CodeWarrior 5 gives MMX code Libraries of schedules utilizing local flag handling Hand code utilizing as a part of line get together for best execution Pack/unload information not adjusted on SIMD word limits 50-cycle punishment to change to MMX; 0 punishment for VIS Saturation number-crunching in MMX; not upheld in
SPONSORS
SPONSORS
SPONSORS