Virtually any. The picoArray was designed to be a versatile system and can implement any air-interface, including various flavours of CDMA, OFDM or narrow-band. Complete reference designs (passing conformance & system level interoperability) exist for both WCDMA and WiMAX. The architecture could easily support TD-SCDMA, GSM/GPRS/EDGE, 802.11 or others. Researchers are using the power of the picoArray for development activities on 4G and on sophisticated MIMO algorithms.
The picoArray has enough processing power to implement the chip-rate functions, and in some systems that may be an efficient implementation (eg those with legacy code in the symbol rate & control).
However, much of the benefit of the picoArray comes from its ability to address all parts of the system, and provide tight integration between them. This is particularly true for the acceleration of development time: much of the saving comes from improvements in integration & verification from addressing all tasks in one environment.
Yes. Towards the antenna & radio, the picoArray is very well suited to algorithms including Digital PreDistortion (DPD) or ‘Smart RF’, as well as to adaptive antennas, beam-forming etc.
We are also involved in projects on MIMO and MUD (Multi-user detection).
The picoArray is based on general purpose processor including both the data-path & control; although optimized for baseband functionality it can handle higher-layers processing. In most systems, the picoArray implements Layer 1 and 2, before handing off to a MAC device. However, in some systems the MAC has to be tightly-coupled to the baseband for performance reasons and picoArray can be used as an efficient implementation platform. For example. in WCDMA Release 5, the MAC-HS for HSDPA is a very fast (low-latency) task tightly integrated to the baseband and it is very well suited to picoArray implementation.
Yes. Each (of eight) 16-bit ADI ports handles up to ~150MSPS, so interfacing to a digital IF is not a problem. For faster rates, or more complex systems, these can be multiplexed or combined (e.g. higher bandwidth I & Q to two separate ADIs). In this way, the picoArray can be used for digital upconversion/downconversion.
The power-consumption is very linear with the number of processors active in a system, and how busy they are. Typical dissipation on the PC102 is about 5W.
Whether this requires forced air cooling would depend on the system design (board area, air flow, etc). That said, we have seen designs which do not use forced air.
The Inter picoArray Interface (IPI) is simply the internal bus, extended off chip. There are four ports (which can be either IPI or I/O – Asynchronous Digital Interface, ‘ADI’), so a number of picoArrays can be interconnected.
The IPI is simply the external connection of the internal fabric; logically there is no difference between on-chip communication and communication across different devices. The tool-chain (including the simulator) manages a number of devices exactly as it does one – the designer doesn’t see any difference.
Limited only by timing skew (6ns) across the longest separation of the devices. We have connected 16 devices in a grid, and significantly larger sets would not be a problem.
Interesting question. We believe yes, although we have not implemented this functionality. Our current design uses a PowerPC for network processing.
The PC102 has several interfaces for external connectivity:
Eight ADI interfaces: 16-bits each, up to 150MSPS. This could be used connectivity to SERDES, ADC/DAC or other framer functionality.
Alternatively, these can be paired to deliver a 32-bit wide InterPicoarray Interface (IPI) for cascading multiple processors.
In addition, there is a processor interface (eg to a host PowerPC microcontroller) which supports four channels of DMA, together with 24 general purpose registers.
We do not explicitly include OBSAI or CPRI specific interface hardware, but the architecture is suited to these standards with the use of external interface devices (e.g. a serdes).
The interconnect fabric, and how it is possible to seamlessly get a number of processor to work together in a coherent way, is the heart of picoChip's IP.
The inter-processor communication protocol is based on a time division multiplexing (TDM) scheme, where data transfers between processor ports occur during time slots, scheduled at compile-time, and controlled using the bus switches. The bus switch programming and the scheduling of data transfers is all fixed at compile time.
The processing capacity figures given are the net processing calculations for the processors (i.e. co-processor figures are not included). The overhead required for the deterministic inter processor data transfers is managed by the hardware. The peak and average device processing capacity figures for PC102 were calculated as follows.
Peak processing capacity = (Peak of 4 LIW instructions per operation) x (160 MHz clock) x (308 processors per device) = 197.1 GIPS. The four LIW instructions in the peak processing capacity calculation are based upon three execution units (3-way LIW) plus a left or right logical shift included in the operand of the first instruction.
Average processing capacity = (Average of 2 LIW instructions per operation) x (160 MHz clock) x (308 processors per device) = 98.6 GIPS (sustained performance based on observed data from typical complex systems).
Within the peak signal processing capacity of the device, the peak rate for 16-bit multiply accumulate instructions is = (240 STAN processors per device) x (1 multiply-accumulate instruction per operation) x (160 MHz clock) = 38.4 Giga multiply-accumulate (GMAC) instructions per second. In addition, there are (64 MEM + 4 CTRL processors per device) x (1 multiply instruction per operation) x (160 MHz clock) = 10.9 Giga multiply instructions per second.
The application specific instructions can deliver significantly higher performance. For example, the complex despread instruction inside the STAN processor (of which there are 240 in the PC102) can replace 40 conventional DSP operations in a single cycle.
Additionally, the co-processor includes configurable hardware for correlation, selection and comparisons which can efficiently accelerate both detection and FEC.
It is very difficult to sustain peak performance in a normal DSP: task-switching, blocking/unblocking data, control, all add significant overhead (perhaps 30%?). However, we do not have that overhead (all task-allocation is decided at compile time): the 40GMACs quoted is realistic and sustainable, operating in one area, while an additional 100 GOPS are going on simultaneously for other operations.
The total internal memory bandwidth is 322 processors x 2 buses x 32 bits x (160 MHz clock) = 3.3 Tera bits per second.
This is all determined by the toolchain (see questions below): the programmer neither knows nor cares about which processor is performing a specific task, or how they are explicitly connected within the fabric. The programmer specifies the relations between blocks, the signal types and the communications intensity between them; the tools configure the interconnect to deliver that relationship. The developer always works at a level of abstraction of processes and relationships.
To put it another way, the programmer specifies logical connections, and the tools map that into a physical set of bus segments and processors.
The fabric supports TDM, so different bus segments can carry different signals: in effect, the connection matrix changes every cycle to a new routing pattern, with the connections between the processor changing. High bandwidth (or low latency) signal paths are implemented with connections that regularly repeat in many slots.
As such, the engineer can use any:any connections, as well as one:many and many:one (fan-out and fan-in) of signals. This enables any arbitrary topology, as well as supporting one control processor supervising several data-path devices.
A single signal can only connect either one:one or one:many between AEs. Or to put it another way, a signal can have many receivers but only one driver. Fan-in is achieved by using several different input signals to one AE.
This is primarily to increase bandwidth and as a result of analysis of the wireless applications. PUT and GET use a register pair (2*16 bits) so the data is readily available. This helps to accommodate I and Q channels at once.
Absolutely not; this is extremely important. The fabric supports any: any communications. This is especially significant for more complex systems, or for tasks that demand more than “a dumb data-pump” – for example control tasks, OAM functions or establishing/monitoring the data-plane.
Unlike some other parallel structures which only have nearest neighbour communications only, the picoArray supports an any:any fabric and a rich variety of connections.
picoChip feel that “real world” systems feature complex control-data interactions and need a wide variety of communications links, which are too difficult to represent with a nearest neighbour structure.
A notable aspect of the fabric is that it is strictly deterministic, with communication established at compile-time (not run-time) – analogous to an FPGA. There can be no conflicts. As a result, there is no need for run-time arbitration, and performance is predictable – simplifying debug & test.
No. The tools ensure that any requested configuration/connection will be implemented.
For signal flow (data path), yes. Processors pass signals between themselves, and the system architecture maps to the flow of data.
Control tasks can be implemented in a similar way, with polling of state on a regular basis.
However, this is not mandatory: the system equally supports conventional sequential processing and asynchronous operations. This is primarily used for control tasks, watchdogs, “probes” (non-intrusive debugging points), etc.
The picoArray has 8Mbits of on-chip memory, distributed across the array and local to each processor.
In addition, external SRAM or SDRAM can be used or accessed – eg to store interleaving tables or for large buffers.
Incidentally, memory access supports byte, word & long-word
Yes. All communications over the 32-bit picoBus is represent by a signal. Data on the signal may be broadcast data from one AE to multiple AEs. Whether the signal goes over an IPI or not is not really relevant, as the IPI is transparent within our tools flow.
The power of the picoArray comes from the huge amount of parallel processing capacity and internal communications bandwidth. It is applicable to many challenging signal processing problems, as they tend to fall into a processing paradigm that emphasize a large amount of processor requirement rather than absolute speed on a single thread. However, the picoArray is not a general purpose processor and therefore may not be suited to certain classes of problems.
There are 308 processors in the PC102. They come in three different “flavours” (STAN, MEM and CTRL).
All have the same basic architecture (16-bit Harvard architecture, 3-way LIW), share a core instruction set but have different memory configurations and some specialist features. All operate at 160MHz.
STAN are very standard DSPs, with Harvared architecture, built in MAC and a spread/despread accelerator. Each has 768bytes of memory, and are usually used for fast DSP tasks, executing tight code loops. MEM (Memory) has 8.5Kbytes, for either small local control or block-oriented DSP, while CTRL (control) has 64Kbytes for control tasks, these do not have the MAC.
The fact that all processors all have local memory is very helpful. This memory is shared between program and data, and can be portioned between the two in different ways.
As well as the basic core, STAN includes a specialist instruction for spread/despread (replaces 50 conventional DSP operations in a single cycle) which can be used in any spread spectrum system.
Because each processor has local memory (both program and data) the effective memory bandwidth is increased dramatically (>3Tbps) compared to a normal single processor architecture (linearly with the number of processors), increasing effective processing speed in many applications.
Furthermore, the close relationship with processor and storage reduces bus-traffic, lowering power.
Finally, the nature of storage localizes algorithm implementation, improving both efficiency & ease of debug/verification.
Incidentally, memory access supports byte, word & long-word.
Yes. If a processor is not used in a design, it is disabled & goes into very-low power mode. If a processor is used, but is not currently active, it goes into sleep mode (for example, if waiting on data).
In general, for most tasks, any processor can be used. In the tools, an engineer can specify a particular processor type, or a ‘generic’ processor type of ‘ANY’ which allows the tool freedom to use any of the cores; perhaps 90% of the code is suitable for ‘ANY’ (which obviously helps utilization).
That said, the processors are optimized:
STAN are obviously oriented towards standard DSP tasks and those requiring multiply-accumulate operations: filters, FFT etc
MEM are very general but are used for tasks requiring more memory, whether local control processors, or for block-signal processing.
CTRL have the largest memory and are typically used for control tasks, whether ones requiring large program storage or large amounts of data performance tables etc)
The mix of processor types, and the relative number of each, was determined after a lot of research: implementing different algorithm types (WCDMA, OFDM, as well as standard benchmarks & building blocks such FIR filters, FFTs and the like), and simulating the implementation & efficiency across a wide variety of different processor types & mixes.
The instructions sets were carefully designed & optimized following extensive design & simulation. Specialist support ranges from MAC through to the Spread/Despread set of instructions.
It is (surprisingly) easy. Great effort was put into making it straightforward to program.
Each processor is orthogonal and only interacts through defined means; as such, there is no necessity for the developer to explicitly “manage” or coordinate them – that is all handled through the tools. In many respects this is familiar object-oriented programming: each task is assigned to one, independent, block.
As one major OEM described it: “Building a complex system out of a hierarchy of small blocks is fundamental in engineering. It is far easier to manage a number of simple, independent blocks each running dedicated tasks than to try to explicitly co-ordinate many real-time tasks on a single large, complex processor”
Programming is at two levels.
First, the logical structure & hierarchy of the design is specified. This could be based on a hierarchy diagram, system tools such as Simulink, or modelling frameworks like UML. This structure specifies the names, relationships & connections between independent blocks. At present, the relationships and block diagram structure is specified in structural VHDL. This is not the complex behavioural VHDL and there is no synthesis – it is used purely as a textual net-list.
Secondly, the individual elements are programmed. Programmers can write in ANSI C – we have a full compiler (you can even run “Hello World”) and most complex tasks & control sequences are written this way. However, time critical blocks will tend to be written in assembly however, and this is fairly easy: although it is our own instruction-set it will be very obvious and familiar to most engineers.
In essence, you map a process to a processor, and then code that processor. This is then combined in a hierarchy of processes
The developer doesn’t need to worry about any of those things: the toolchain handles it all: compilation (or assembly) of code for individual processors, as well as the relationships & communications between them.
Even a large, complex design, can be fully complied & placed on the array (with all processors & all communications assigned) in a few minutes.
For development, it doesn’t matter: you write the code, compile it and place in array. Since timing is explicit and independent of program execution, there is no implication of estimation & sizing to assess performance.
It is very easy to extrapolate from the size of a block and multiply-up to get the required capacity for a complete system. This is a consequence of the independence & orthogonality of blocks: the combination tasks is a linear combination (there are no ‘side-effects’ and hence no need to allow for ‘overhead’ or ‘margin’).
This predictability is a major advantage of the picoChip development environment.
There is support for hand-placement, but it has never been necessary: implementation & realisation of a design are all completely automated by the tools.
A source level debugger is implemented within the toolchain, and there is also the simulator.
Very importantly, it is easy to introduce “probes” which can capture signals or data. This is analogous to debug points in non-real-time code but (crucially) differs from usual DSP debugging in that these “probes” are completely non-invasive: they do not load the signals, they do not introduce delays or side-effects.
Secondly, because of “encapsulation” it is easy to create a test harness for a block which exactly replicates its real situation.
The software environment is complete, with all the tools you would expect to find in a mature, processor development (C Compiler, assembler, source-level debugger, simulator, etc).
One thing to stress is that the simulator and debugger are both bit accurate & cycle accurate.
Very. We use standard structural VHDL (‘pico’ just refers to the fact that there is no behavioural element, and so no synthesis etc) to describe the design’s structure, and the relationships between blocks.
It is essentially a textual netlist. It is easy to learn, with only items being entity names, inputs/outputs and relationship lists.
The advantage of using VHDL are that it is well known, and that is well suited for management of complex designs, with structures, multiple instantiation, strong rules & support for scope etc. Similarly, there is much experience or design-support tools available to assist in this management.
However, once more, this is structural VHDL, not behavioural – all behavioural aspects of the system are coded in C or assembler.
ANSI / ISO C.
The toolchain currently runs under RedHat Enterprise 3.0 or 4.0.
The entire picoTools distribution (including documentation) requires about 150Mb of hard disk.
Because each processor is self-contained, and only communicates via explicitly defined signals, with strictly controlled type & timing, they are essentially “encapsulated” from each other.This allows a hierarchy of blocks to be built up. Code written in one block can have no “side effects” for code in another block, as the timing & flow of each is hidden from the other. The only way they interact is thorough the signals and structures defined by the developer, and this are clearly visible and can be traced or debugged as required.
It is possible to change blocks, to place them in test frameworks etc, with exact simulation of their situation.
The picoArray is a completely deterministic architecture. All decisions on code & interconnect are determined at compile time, not at run time: as such there is no scheduling, arbitration or run-time decisions. This means that the simulator is both bit-accurate & cycle-accurate. It also makes debugging & test much easier: since there are no run-time decisions, the scope of possible interactions or test cases is drastically reduced.
In general, it is not meaningful or necessary.
The role of an RTOS is to share resource (cycles) on a single processor across multiple tasks; in the picoArray, most different tasks are simply assigned to different unique processors. As these are independent, with no resource conflicts, there is simply no need for an RTOS. The removal of this overhead for task-switching improves efficiency & simplifies debugging. (In this respect, picoArray is analogous to an FPGA). In effect, datapath algorithms are “Space-sliced” by spreading them across an array, rather than “time-sliced” on a single processor.
In some situations an RTOS can simplify architecting large control software and we do have a simple RTOS, which can run on the CTRL or MEM processors and provides a familiar programming tool-box. It is a co-operative round-robin scheduler and is used to support multiple-tasks in control-plane structures. This is valuable in algorithms or structures which are inherently sequential or time-sliced, and where a parallel structure would be unnecessarily complex or “unnatural”. (This is a good example of hoe the picoArray can deliver “the best of both worlds” of conventional DSP/RISC or FPGA/ASIC development paradigms: the engineer can select which makes the most sense for a task).
It is easy to insert “probes” into the system to non-intrusively monitor signals or states. In the same way, watchdog functionality can be added to monitor operations. These could trigger a reset of a particular element or group of processors. However, it is worth stressing that the debugging methodology & deterministic execution make this less of a problem than for traditional architectures.
Two applications where it may be of interest/appropriate are for high-reliability systems (even to the extent of redundant processors & voting), or if there are concerns over soft errors.
Less than ten
minutes, often considerably less. Here are some examples:
Example
1.
Single PC101, 92% utilized. Less than 5 minutes on a 1.7
GHz Pentium III PC.
Example 2.
Five PC101s, 67%,
13%, 49%, 59%, 0.5% utilized (typical of a system that may be used
during development). 9 minutes on a dual 1.7 GHz Pentium III PC. The
tools make use of the dual processors by running picoPlastic for
different picoArrays on separate CPUs.
The time was split as
follows:
- 2 mins 25 seconds compilation (C)
- 10 seconds
partition (computing signal flow between the separate devices)
- 6
minutes picoPlastic
This is for a full recompilation. Partial
recompilations (eg: after adding a probe) take considerably less
time. In many cases, only the internal code within a processor is
changed, while the connections stay consistent – in this case
compilation itself (with no place & switch) is very much faster
(of the order of a few seconds).
For FPGAs that provide the same functionality the place and route time is typically several hours, and usually the engineer needs a detailed knowledge of the EDA tools. Running picoPlatic is always a single button press. No additional knowledge is required.
Yes. The DVK (development kit) supports two picoArrays and is intended for individual users developing code, or small systems; the SDP (System Development Platform) can have up to 16 picoArrays and is intended for larger, more complex projects.
Both include local memory, PowerPC microcontroller and Ethernet interface.
Full development support, from applications engineers through to turn-key development projects.
picoChip was expressly designed & architected for large, complex systems.
The ability to mix different programming styles (block/stream) and to support local control (or hierarchical control) makes development much easier.
The inherent nature of the array enforces decomposition & encapsulation between processes, essentially eliminating ‘side effects’ or run-time dependencies. This is further supported as the use of VHDL as a structure makes it easier to use standard tools for managing a complex design project.
Now. PC102 chips are in volume production.
Because of the architecture, dramatically easier than conventional devices. Integration & Verification are essentially eliminated by the deterministic design. It is possible to exactly predict performance and explicitly design for a worst-case condition, rather than relying on statistical judgement of resource loading & relying on extensive (time-consuming) testing.
Test is radically simpler than conventional architectures: as blocks are independent, testing them in isolation is valid, reducing the test case to SUM(N) rather than FACTORIAL(N) for a conventional system where all blocks can interact, or load each other via run-time dynamic decisions.
To a degree we offer "the best of both worlds" of DSP & FPGA: programmed in familiar C or ASM, using standard algorithms - but with all the verification & statistical test advantages of hardware.
This is very different from a complex legacy DSP. To quote Jeff Bier of BDTI " [this] is a challenging target for programmers and compilers - the deep complex pipeline is a particularly difficult challenge. Code optimisation is further hampered by the dynamic caches which reduce execution time predictability. Getting tight timings, in a controlled repeatable way, is especially hard when so many things depend on what other code is executing, which is un-knowable in advance".
In contrast, everything in picoArray is predictable in advance and is thus easier to code, optimize or debug.
Yes, both from internal experience & customer feedback. We estimate that developing code on a picoArray is at least 50% faster than on a legacy DSP (due primarily to reduced verification time).
For example, we had our 3GPP compliant Node B code up and running inside four months from first silicon.