

The Engine of SOC Design

### DATE 2008: Tutorial A:

**Automatically Realising Embedded Systems**From High-Level Functional Models:

- From Platform-Independent Models to Platform-Specific Implementations for Processor-Centric Design

Grant Martin
Chief Scientist, Tensilica
Monday 10 March 2008: 0930-1800

© 2008. Tensilica Inc.

# tensilica Agenda

- Platform-Independent Models for processor-centric design
- Options for implementation
- ASIPs
- Determining the platform specifics
  - Manual methods
  - Automated methods
- Examples and Quality of results



### tensilica Platform-Independent Models for processor-centric design

- Usually algorithmic or control code in C, C++, possibly SystemC (e.g. C/C++ with SystemC interfaces) or other C dialect
- Might be generated from higher level specification: e.g. SDL, a UML profile, Mathworks Matlab/Simulink
- Such models are as "generic as they come"
  - Will compile and execute on just about any host platform
  - Usually will compile and execute (possibly poorly) on embedded processors of interest
- How does a design team decide what implementation platform to map this model to?



## tensilica Choosing implementation options

- Systems design teams will choose an implementation form based on many criteria
  - Technical:
    - Performance, power, energy consumption, size, weight, heat, programmability, packaging .....
  - Economics/Business:
    - Cost, Risk, Design effort, Verification effort, Field flexibility, Price, Volume, Life-cycle cost, Expected iterations/derivatives, ....
- Often these criteria narrow the choice window
  - For example, cost + target design time and risk might dictate an ASSP or ASIC choice
  - But even within this window, there are several choices e.g. custom digital HW, SW on different kinds of processor
- Performance considerations will usually further narrow the choice
  - Design teams need to consider all the reasonable alternatives to optimise product architectures

© 2008. Tensilica Inc.

## tensilica Processor options

- Not as narrow a choice as it might seem
  - Processors may vary by 100 to 1 or more on criteria such as performance and energy consumption for a specific task
- Even for embedded applications, processors range widely in their suitability for specific applications or tasks
  - As long as they have sufficient memory, any processor should be able to run any task
    - Thus all processors have flexibility
    - But specific processor features make one more or less suited to particular tasks

# tensilica Processor features

- Rated in order of application-specificity (approximate)
  - Size, number and types of local and system memories
  - Multi-operation instructions (for applications with high unstructured concurrency
- Following often found in DSPs in particular
  - SIMD operations (for applications with high structured concurrency – e.g. vectorisation)
  - Low or zero-overhead looping
  - Special function units multipliers, MACs
  - Rounding and truncation instructions
- Usually found in ASIPs:
  - Highly specialised application-specific instructions

© 2008. Tensilica Inc

## tensilica ASIPs

- Application-Specific Instruction-set Processors
- The highest form of platform-specific implementation for processor-centric design
- · Might be manually designed
- However, usually designed through automated tool flows driven by:
  - GUI-based configuration parameters
  - Architecture description languages (ADL's)
- · Considerable research into this
  - E.g. Masaharu Imai (Osaka U.) PEAS-I,II, III, ASIPMeister
- Many commercial capabilities



### tensilica Architecture-Description Language **ASIPs**

- Often resulted from research projects
  - E.g. nML TU Berlin
  - Lisa RWTH Aachen
- Research often used to generate software tools only
  - Eg Chess/Checkers ISS and compilers- IMEC
  - Lisa ISS
- Sometimes commercialised
  - Target compilers
  - Axys now ARM; Lisatek now CoWare
- Sometimes entirely commercial
  - E.g. Tensilica Instruction Extension Language (TIE)
    - · Used for most of HW and SW toolchain but not for all parts of processor
    - · Rest configured using GUI configurator









# tensilica Determining the platform specifics for ASIPs

### Manual Methods

- Profiling source code execution on target processor
- Finding time consuming routines
- Finding time consuming loop nests
- Devising new instructions
- Optimising memory accesses
- Modifying source code, and
  - ..... Round the loop again

### **Automated Methods**

 Seek to move from source code to ASIP configuration in a single step









# tensilica

# Automated methods for ASIP specification

- Long a research topic
  - For example, the work of lenne and Pozzi at EPFL has been reported in many papers over the years
    - Concentrated on instruction fusions
- Also been available for some commercial extensible processors
  - For example, Tensilica's XPRES tool



## tensilica XPRES Compiler Flow

- Profile / Analyze application
  - Identify performance critical functions / loops
  - Collect operation mix, dependence graphs
  - Provide feedback to user code transformations to better target meta-ISA (esp. vectorization)
- Generate sets of ISA extensions
  - Each set implements some dimension of meta-ISA
  - Evaluate each set across all functions / loops
  - Performance and cost estimates allow exploration of large design space
- Collect ISA extensions that together provide maximum performance improvement
  - Each collection of ISA extensions forms ASIP
  - Generate family of ASIPs for varying hardware cost budget



# tensilica

# TIE Techniques Employed by XPRES Compiler: 3 kinds of parallelism

- Instruction-Level Parallelism: FLIX
  - Bundling independent operations
- Data-Level Parallelism: SIMD / Vector
  - Single instruction, multiple-data operations
- Pipeline Parallelism: Fusion
  - Merging of compound, dependent operations
  - E.g. a multiply-add or a multiply feeding a shift
- All three techniques can be combined
  - E.g. Parallel FLIX execution of Fused-SIMD operations



### tensilica Instruction-Level Parallelism (ILP)

- XPRES supports ILP with multiple-issue
  - 32-bit or 64-bit instructions
  - Each instruction can contain 1-15 slots
  - Each slot can contain arbitrary mix of operations
- Implemented in Xtensa using FLIX
  - Similar to VLIW without code size increase
- Xtensa C/C++ compiler exploits FLIX automatically
  - Software pipelining for loops
  - List scheduler for other code
  - Code scheduling and packing in both compiler and assembler

© 2008. Tensilica Inc.



### **XPRES: Instruction Parallelism** (exploited using FLIX)

Performance

Hardware cost

Increased IPC

#### Original C Code

#### 64-bit 3-issue FLIX

```
format f 64 { s0, s1, s2 }
slot_opcodes s0 { L32I, S32I }
slot_opcodes s1 { SRAI, MULL }
slot_opcodes s2 { ADDI }
```

### Additional register file ports **Generated Assembly**

Replicated function units

#### 64-bit 2-issue FLIX

```
format f 64 { s0, s1 }
slot_opcodes s0 { L32I, S32I }
slot_opcodes s1 { ADDI, SRAI, MULL }
```

### (2 iterations in 8 cycles, 2x)

```
loop:
    { 18ui a4,a11,12; addi a13,a13,4 { 18ui a2,a9,12; mul16u a3,a3,a2 } { 18ui a14,a9,4; srai a12,a14,4}
```



## tensilica Data Parallelism

- XPRES supports data parallelism with SIMD operations
  - Vectors of length 2, 4, 8, and 16 (up to 256 bits)
  - Support for unaligned vector loads and stores
  - Vector C-operators, MIN, MAX, ABS, reductions
  - Vector user-defined operations
- Xtensa C/C++ compiler exploits automatically
  - Automatic loop vectorization
  - Source attributes and compiler options specify aliasing and alignment directives to enable additional vectorization opportunities







- XPRES supports pipeline parallelism with fused operations
  - Fused operation composed of two or more other operations, plus possibly constant values
  - Latency of fused operation usually less then combined latency of composing operations
  - -Limits on input/output operands, number of composing operations, hardware cost, max latency, etc.
  - -Graphical support for manual fused operation generation
- Tradeoff performance and generality
  - -Performance: large fusions w/ fixed constants
  - -Generality: smaller fusions, fewer fixed constants
- Xtensa C/C++ compiler exploits automatically
  - Fused operations automatically replace sequences of composing operations

© 2008. Tensilica Inc.



#### Original C Code

- Performance
  - Decrease instruction count
  - Decrease computation latency
- Hardware cost
  - Logic to share function units
  - Register file ports

Generated Assembly (1 iteration in 5 cycles, 1.6x)

loop:
 l8ui a12,a11,0
 l8ui a13,a10,0
 addi.n a10,a10,1
 addi.n a11,a11,1
 fusion.mull6u.srai.s8i.addi a9,a12,a13

























# tensilica Improving the quality of results of automated methods

- Although the ideal is to use unmodified source code, often simple transformations will yield much better results
- Maximise instruction fusion
  - Sometimes can combine manual (deep) fusions with automated techniques
- Maximise utilisation of multi-operation instructions:
  - Enable software pipelining
  - Enable loop unrolling
  - Fixing aliasing
- Removing control flow (if not supported in automated techniques or microarchitecture)



### tensilica Maximise use of SIMD:

- Expose data-wise parallelism and enable vectorisation
- Aliasing
- Strides and gaps to allow efficient data loading and storage via contiguous data
- Outer vs. inner loop vectorisation in nest
- Function calls inline where possible
- Data alignment
- Pointer dereferencing

# tensilica Summary

- ASIPs are a powerful mechanism to go from platform-independent models to highly efficient implementations of software on embedded processors
- Many choices of ASIP specification methods available
  - Both commercial and research
  - Manual and automated
- Use of fine-grained instruction extension and coarse-grained structural adaptation can mean up to a 100 to 1 performance improvement over a general purpose processor, up to 10 to 1 energy reduction, in similar or less area

© 2008. Tensilica Inc.

### tensilica References

- ADLs:
  - Prabhat Mishra and Nikil Dutt, "Processor Modeling and Design Tools", Chapter 8 in Volume I: EDA for IC System Design, Verification and Testing, in the Electronic Design Automation for Integrated Circuits Handbook, edited by Louis Scheffer, Luciano Lavagno and Grant Martin, CRC/Taylor and Francis, 2006.
- ASIPs
  - Akira Kitajima, Makiko Itoh, Jun Sato, Akichika Shiomi, Yoshinori Takeuchi, Masaharu Imai: Effectiveness of the ASIP design system PEAS-III in design of pipelined processors. ASP-DAC 2001: 649-654
  - Paolo lenne and Rainer Leupers (editors), Customizable Embedded Processors, Elsevier Morgan Kaufmann, 2006
  - Matthias Gries and Kurt Keutzer (editors), Building ASIPS: the MESCAL methodology, Springer, 2005.
- · Tensilica Examples
  - Chris Rowen and Steve Leibson, Engineering the Complex SOC: Fast, Flexible Design with Configurable Processors, Prentice-Hall PTR 2004.
  - Steve Leibson, Designing SOCs with Configured Cores: Unleashing the Tensilica Xtensa and Diamond Cores, Elsevier Morgan Kaufmann 2006.