home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!munnari.oz.au!yoyo.aarnet.edu.au!sirius.ucs.adelaide.edu.au!augean.eleceng.adelaide.edu.AU!gvokalek
- From: gvokalek@augean.eleceng.adelaide.edu.AU (George Vokalek)
- Newsgroups: comp.dsp
- Subject: Re: Looking for users of TI TMS320C40 DSP chip and/or 3L 'C' compiler
- Message-ID: <1992Jul21.064853.29476@augean.eleceng.adelaide.edu.AU>
- Date: 21 Jul 92 06:48:53 GMT
- References: <5362@mccuts.uts.mcc.ac.uk>
- Organization: Electrical & Electronic Eng., The University of Adelaide
- Lines: 853
-
- It really pisses me off when I type in a reply to a post and the
- email bounces! Therefore I will share my annoyance with the
- world by posting my reply:
-
- ----- Transcript of session follows -----
- While talking to nsfnet-relay.ac.uk:
- >>> DATA
- <<< 554 No date field given
- 554 ehgabp2@uts.mcc.ac.uk... Service unavailable
-
- ------- Received message follows ----
-
- Received: by augean (5.61+IDA+MU/4.8Z)
- id AA29036; Tue, 21 Jul 1992 16:11:19 +0930
- Date: Tue, 21 Jul 1992 16:11:19 +0930
- From: gvokalek (George Vokalek)
- Message-Id: <9207210641.AA29036@augean.eleceng.adelaide.edu.au>
- To: ehgabp2@uts.mcc.ac.uk
- Subject: Re: Looking for users of TI TMS320C40 DSP chip and/or 3L 'C' compiler
- Newsgroups: comp.dsp
- References: <5362@mccuts.uts.mcc.ac.uk>
-
- In comp.dsp you write:
-
-
- >We are looking for users of the Texas Instruments TMS320C40 DSP chip
- >and the 3L 'C' compiler, which produced object code suitable for this
- >chip, hosted on a PC AT.
-
- >Has anyone done any comparisons with the Inmos T800 using a 3L compiler?
-
- While I cannot answer your question directly, I can say that the
- C40 should run rings around the T800, especially in floating point.
- The C40 is more comparable in performance to the mythical T9000.
-
- Our transputer people have just opted for the Inmos C compiler rather
- than the 3L, since we do more embedded type work (rather than big
- hyper-parallel-type-work), and the 3L is 2-3 times the cost of the Inmos.
-
- I have also avoided the 3L compiler for the C40 (we are waiting for
- our C40 hardware to arrive) for similar reasons.
-
- I do have an electronic copy of a paper by 3L discussing its porting
- of the compiler to the C40. That paper is appended at the end of this
- message. (Thanks to S.J.Bradshaw at Traquair Inc. for providing it to me).
-
- Regards,
-
- George.
-
- ******************************************
-
-
- Porting the 3L Parallel C
- Environment to the
- Texas Instruments TMS320C40
-
- Alan D. Culloch, Director
- 3L Ltd., Peel House, Ladywell, Livingston EH54 6AG, Scotland
-
-
- Abstract.
-
- The TMS320C40 ('C40) is a transputer-like parallel processor from Texas
- Instruments. It is an order of magnitude faster than the T800
- transputer. Parallel C is a popular programming environment for the
- transputer. The properties of both the 'C40 and Parallel C are
- described and the significant differences between the 'C40 and the
- transputer are pointed out. The techniques used to overcome these
- obstacles to porting Parallel C are presented. These include building a
- new real-time kernel and reusing existing software packages from
- industry and academia. The suitability of the 'C40 for parallel
- applications is discussed.
-
-
-
-
- 1. The TMS320C40 Chip
-
- The TMS320C40 ('C40) is the latest member of Texas Instruments' TMS320 family
- of Digital Signal Processing (DSP) chips. It is the first processor
- from a major US semiconductor vendor to feature transputer-like built-in
- communications links for parallel processing. In addition, it gives
- state-of-the-art single processor performance equalling or exceeding existing
- ``superchips'' such as the i860. The device is fully described in [1].
-
- The performance of the 'C40 is certainly impressive: 275 million
- operations per second and 50 MFLOPS. Six 160Mbit/s communications links
- give each node in a network a total communications throughput of 120Mbyte/s.
-
- The parallel processing features of the 'C40 were added in response to
- feedback from customers. They indicated that large parallel DSP
- and real-time systems were already being built using the previous
- generation of standalone chips (the 'C30). What was wanted was a way to
- simplify the design and implementation of such systems. This was
- achieved by adding fast communications links to the processor. These
- make it easy to construct large MIMD parallel systems with no external
- communications support logic.
-
- 2. Availability
-
- Sample 'C40 silicon has been available since this formal launch of the
- chip in June 1991. Board-level hardware became available from European
- vendors such as Hema (Germany) and Hunt Engineering (UK) by the end of
- 1991.
-
- We now look in turn at each of the features of the 'C40 design which are
- most significant for parallel systems development.
-
- 3. Memory Architecture
-
- 'C40 memory is always organised in 32-bit words. Byte addressing is not
- supported. I.e., the first word in memory is at address 0, the next
- word is address 1, and so on. Integer, 32-bit floating point and
- character data items all occupy one word each (although it is possible
- to pack characters into integers "by hand" by shifting and masking.) In
- C terms, the sizeof all atomic types is 1.
-
- 3.0 Data Formats
-
- Floating-point data are held in a proprietary 32-bit format.
- Instructions are provided to convert to and from the IEEE standard
- 32-bit format. Intermediate results can be held in the processor's
- internal registers in an extended-precision 40-bit format.
-
- 3.1 Memory Buses
-
- The 'C40 has two external memory buses, called the local bus and the
- global bus. The first 2Gword half of the 4Gword (16Gbyte) 32-bit
- address space accesses the local bus; the other half accesses the global
- bus. Having two external memory interfaces provides greater total
- CPU/memory bandwidth and increases the available parallelism.
-
- The first 3Mword of the local bus region is reserved for peripherals
- (including internal peripherals like the on-chip timers and DMA engines)
- and two blocks of internal RAM. Depending on the setting of an external
- pin, the first 1Mword, from 0-1M, is mapped either to the local bus or
- an on-chip ROM containing built-in bootstrap code.
-
- 3.2 Internal RAM
-
- The 'C40 has two blocks of on-chip RAM. Each block is 1Kword (4Kbytes)
- long. The on-chip RAM can be used for code or data: it appears at the
- end of the 3Mword of reserved space below the local bus region, and is
- contiguous with the start of local bus memory.
-
- 3.3 Parallel Memory Access
-
- Each different block of memory (internal ROM and RAM blocks 0 and 1,
- the external memory buses) can support two concurrent accesses, allowing
- a considerable amount of micro-parallelism in the execution of
- sequential code. For example: the CPU can access two data values in one
- RAM block and perform an external program fetch in parallel with one of
- the DMA engines loading another RAM block, all within a single cycle.
-
- 3.4 Cache Memory
-
- To minimise fetching of program code from relatively slow external memory the
- 'C40 contains 512 bytes of on-chip cache for instructions. This is in
- addition to, and operates independently of, the two blocks of
- general-purpose on-chip RAM.
-
- 4. CPU Architecture
-
- 4.1 Register File
-
- The CPU has 32 registers. Twelve of these are general-purpose registers
- which can hold either word data or floating-point data in the
- processor's extended-precision format (40 bits wide.) Another eight
- general registers can hold word data only. The remaining registers have
- dedicated uses, as stack pointers, index registers, loop counters and so
- on.
-
- 4.2 Instruction Set and Pipelining
-
- 'C40 instructions have a three-address format specifying two source
- operands and a destination location. Various operand addressing modes
- are provided (register, immediate, direct, indirect.)
- However, addressing is not completely general: instructions generally
- restrict the allowed addressing modes for their operands and results in
- various ways, for example, by requiring that the result of an operation
- be stored into one of the twelve extended-precision registers.
-
- All the usual operations on arithmetic and floating-point data are provided.
- One interesting group of instructions provides for loop control without
- the overhead of embedding compare and branch instructions in the
- instruction stream being executed. The processor contains some
- special-purpose registers for loop counting and holding pointers to the
- start and end of a block of code to be executed repeatedly. In its
- special "repeat mode" the processor implicitly moves the PC back to the start of
- the loop when it steps over the last instruction of the loop block.
- Inner loops can therefore dispense with the overhead of counting,
- comparing and branching at the macroinstruction level.
-
- As is now the norm for RISC designs, delayed branches allow overlapping
- of useful instructions with the fetching of oout-of-sequence
- instructions. The 'C40 in fact goes a lot further down this path to
- increased performance, by the provision of quite a large number of
- "parallel" instructions. These allow particular pairs of operations
- to be overlapped in a single instruction cycle, for example
- multiplication of two floating-point values can be performed in parallel
- with storing another floating-point value in memory. The source and
- destination locations for both instructions must be encoded into a
- single instruction word, and are therefore more restricted in the
- addressing modes allowed than other types of instruction.
-
- 5. Inter-Processor Communication
-
- 5.1 Communications Ports
-
- The 'C40 has six built-in communications ports. Each can handle
- bi-directional communication at 20Mbyte/s, giving a total maximum
- throughput of 120MByte/s. Ports are connected to each other by a
- 12-wire parallel interface (8 data bits plus 4 control lines.)
-
- Although the links are bi-directional, they are not full duplex. Data
- may only flow across the link in one direction at a time. However,
- "turnaround" of the direction of the traffic is automatic when the link
- is accessed. Ownership of the link is managed by a hardware
- token-passing scheme which uses some of the four control lines.
-
- On reset, three of the links go into output mode; the other three are
- forced to input mode. A board designer must ensure that in the reset
- state outputs are connected only to inputs, and vice versa. Failure to
- do so can destroy the processor. Thankfully this problem is not visible
- to software.
-
- 'C40 links are buffered. Both the input and the output side of each link
- have an 8-word hardware FIFO queue attached. Therefore there is
- effectively 16 words of buffering on a full communication. Writing to a
- full FIFO, or reading from an empty one, blocks the CPU until the
- condition is cleared.
-
- Reading and writing the links is straightforward. The communication
- ports appear as mapped devices in the 'C40 address space. Sending a
- word over an output link is simply a matter of writing a word to the
- address of the mapped device.
-
- 5.1 The DMA Coprocessor
-
- The 'C40 provides an on-chip DMA engine to relieve the CPU of the
- need to intervene in inter-processor data transfer. The DMA engine
- has its own separate internal buses connecting it to memory and the
- communications ports. It can therefore transfer data without
- interfering with the operation of the CPU.
-
- The DMA engine is itself a quite sophisticated programmable device.
- It is driven by control blocks in memory which can be linked together to
- form DMA "programs". Each block specifies the source, destination and
- size of an individual transfer. Indexing allows non-contiguous data to
- be transferred (e.g., every 10th word from an array.) Bit-reversed
- addressing is provided for FFT applications.
-
- The basic operation of the DMA engine is fully general: it can move
- words from any part of memory to any other. After each word is
- transferred, the next is addressed by adding an index value from the DMA
- control block to the previous address. This, combined with the fact
- that the communications ports are mapped as memory locations, allows the
- DMA engine to transfer a block of words to or from a fixed location (such as a
- communications port) by specifying a zero "index" value. A further
- benefit of this approach is that data can be routed directly through a 'C40 by
- the DMA engine from one communication port to another, without any CPU
- intervention at all.
-
- 5.2 Timers
-
- The 'C40 has two on-chip timer/event counter units which can be used to
- generate interrupts and support timeouts on link communications.
-
-
- 6. 3L Parallel C
-
- Parallel C was originally developed by 3L for the transputer. To allow
- porting of applications, a 'C40 implementation of Parallel C would have
- to support the same user interface as the transputer implementation.
-
- The main functions provided by the transputer version of Parallel C
- which would have to be reproduced in a 'C40 port are outlined here.
- Parallel C is more fully described in [2] and [3].
-
- In addition to a conventional C compiler, linker and run-time
- library for the target processor, Parallel C provides the following.
-
- 6.1 Extended Run-Time Library
-
- The C run-time library is extended with data types representing
- transputer channels and with functions to: create new threads of
- execution (processes); pass messages between tasks in a synchronised
- CSP-style fashion, with optional timeout handling; wait/signal on
- semaphores; and perform a CSP-style "alternative wait" operation. This
- last blocks the calling task until one of a number of input channels
- becomes ready to communicate.
-
- In the transputer implementation the message-passing functions can be
- implemented by in-line code accessing built-in hardware facilities.
-
- 6.2 Configurer
-
- In Parallel C, complete systems are built by composing independent
- software tasks. Each main task which is to execute in parallel is an
- independently compiled and linked C program. The tool which binds these
- linked task images together with various bits of bootstrap and loader
- software to produce a bootable multi-tasking application image file is
- called the configurer. It is driven by a user-written text file which
- describes the required hardware configuration, specifies the files
- containing the software tasks, and indicates which tasks are to be
- placed onto which processors.
-
- Replicated tasks placed on many different processors are held once in
- the application image file and copied from processor to processor
- dynamically at load time.
-
- 6.3 Flood Configurer
-
- This variant configurer takes two special kinds of task, the master and
- worker tasks of a processor farm application, and combines them with
- standard bootstrap, loader and message-routing software to make a
- bootable application image which, when loaded, spreads out and "floods"
- every accessible processor in the network with a copy of the same worker
- task. Work packets are distributed from the master to the workers and
- result packets are carried back through the network by the routing
- software. Packet broadcasts from the master to all workers are
- supported for initial setup of data to be worked on in parallel.
-
-
- 7. Rationale for Porting Parallel C
-
- >From the description of the 'C40 processor architecture given previously
- it should be clear that it fits the original sense of the term
- "transputer", in that it is a programmable device with its own integral
- memory and inter-processor communications links. It is therefore
- sensible to consider porting parallel software between these platforms.
-
- The interesting differences between Inmos' and Texas Instruments'
- implementations of this concept are dealt with in a later section. Here,
- we just note that the similarities are sufficient to make it appear
- useful and sensible to make existing transputer software available on this
- new platform, and that this includes development systems such as
- Parallel C.
-
- This would be so even if it were not the case that the 'C40 is an order
- of magniture faster than the currently-available T800 transputers, in
- both raw processing speed and in communications bandwidth. However,
- because there is such a large performance benefit, we expect the take-up
- of this new technology by current transputer users to be quicker than
- it would otherwise be.
-
- The other main reason for undertaking a port was the popularity of Texas
- Instruments' previous 'C30 product in DSP work. The familiarity of DSP
- engineers with this product family should lead to fast take-up of the
- parallel processing 'C40 and a consequent need for parallel systems
- development software such as Parallel C.
-
-
- 8. Technical Issues Involved in a Port
-
- The main technical issues of interest in the porting process are to do with the
- architectural differences between the transputer and the 'C40.
-
- 8.1 Communications Links
-
- Transputer links differ from 'C40 ones in both the hardware and its
- appearance to the software. The 12-wire links of the 'C40 give greater
- speed (20Mbyte/s vs. 20Mbit/s) but require more pins and board space.
-
- These links appear as memory locations to be read or written a word at a
- time under program control (or under control of the DMA coprocessor.)
- There are six of them rather than four, so a wider range of fixed link
- topologies can be built directly, including 3D grids and larger
- hypercubes.
-
- More importantly, the fact that the 'C40 links are hardware buffered
- means that they cannot be used directly to emulate transputer-style
- unbuffered, synchronised message passing. Consider a short message of
- two words sent from processor A to processor B. If the receiving
- process on B is not ready to receive the message, then in the transputer
- case the sending process will be blocked until the receiver becomes
- ready. If 'C40 links are used "raw", then either the message will be
- sent and A will proceed asynchronously, or the FIFOs will be full up,
- and the whole of processor A (not just the sending process) will be
- blocked.
-
- This problem implies that a 'C40 implementation of Parallel C
- will require some low-level software to control the hardware links. All
- transputer-style synchronised message traffic on a link must be under the
- control of this software. Note that some or all links may be used "raw"
- to avoid the software overhead involved here, but that communications
- over those links would not have transputer-like synchronisation
- properties and would have to be carefully managed under user control.
-
- Ensuring synchronisation of sender and receiver requires the link
- control software at both ends of a communication path to exchange
- acknowledgements of receipt using some protocol over and above the user
- data being transmitted. The question of what protocol is used is
- addressed later.
-
- Although 'C40 links are hardware FIFO-buffered, when a buffer fills
- up, the whole processor (not just the sending or receiving process) is
- blocked. User code must either poll "ready" flags to avoid the
- situation, or make use of interrupt signals which indicate when the
- buffers become ready. These interrupts can be used in combination with
- the DMA engine. To avoid one thread blocking the whole processor, the
- handling of the links and these interrupts must be centralised in
- the low-level support software.
-
- Software is also required to handle the discrepancy between
- the number of possible concurrently active communications on the links
- (12, one in each direction on six links) and the six available DMA
- channels. This requires dynamic software management of free DMA
- channels.
-
- 8.2 Processes
-
- Unlike the transputer, the 'C40 has no built-in firmware to manage
- process creation, timeslicing or message synchronisation: these
- functions must be performed by software: another job for the low-level
- software.
-
- Note that the large number of CPU registers in the 'C40 put a premium on the
- run-time efficiecy of such a software context-switching mechanism.
-
- There is some built-in software on the 'C40 held in ROM: this is a
- bootstrap program which allows networks of 'C40s to be booted from their
- links.
-
- 8.3 CPU Architecture
-
- A number of software changes are obviously implied by the differences in
- internal architecture between the two processors.
-
- The register vs. stack-based assembly programming model, the very
- different instruction set, the non-IEEE floating-point format, and the
- lack of byte addressibility obviously all conspire to require a
- completely different compiler.
-
-
- 9. Overcoming the Differences
-
- In deciding how to go about the port and overcome the technical
- difficulties mentioned above, two goals were paramount:
-
- 1. A robust product was required.
- 2. It should be implemented as quickly as possible.
- 3. As far as possible the resulting product should retain the
- application programmer interface of the transputer version.
-
- This had two main implications for our design:
-
- 1. Reuse as much existing, tested software as possible.
- 2. A number of transputer features would need to be emulated in software.
-
- In the rest of this section, we look at the different software
- components of Parallel C, and how they have changed from the transputer
- version of the software.
-
-
- 9.1 The Compiler and Linker
-
- Two options were considered for the compiler: first, use TI's own;
- second, re-jig the existing 3L transputer C compiler by replacing its
- machine-dependent code generator module with a 'C40 version.
-
- This was a relatively easy choice given the previously stated goals.
- Even although the code generator in the existing compiler is quite well
- separated from the rest of the program, using the intermediate-code
- technology described in [4], and the team had considerable experience
- with such work, it is still quite large (15-20K source lines.) Writing a
- new code generator would consume a disproportionate fraction of the
- resources available to the project. Since TI's compiler was already in
- an advanced state of development at the beginning of the project, and
- was an evolution from the existing code-compatible 'C30 compiler, we
- decided to use it as the basis of Parallel C. This met our goal of
- minimising development time, and by gaining access to well-tried code
- should prove to be robust as well.
-
- Subsidiary benefits of this route which also weighed in the decision
- were the resulting automatic compatibility with TI's assemblers,
- libraries, and any third-party libraries written for that compiler.
-
- 9.2 The C Run-Time Library and Host Server
-
- Because it was originally designed with standalone DSP applications in
- mind rather than more general parallel processing work, TI's C compiler is a
- "freestanding" implementation, in the sense of the ANSI standard. I.e.,
- its library does not provide standard I/O and other similar high-level
- services. 3L has therefore had to port those parts of its existing C
- run-time library to the 'C40, in order to provide the full run-time
- environment familiar to users of Parallel C.
-
- As on the transputer, the standard I/O services in the run-time library are
- provided by means of remote procedure calls from library code on the root
- node of the 'C40 network across a link to a host computer. The protocol
- used for communication with the host is a variant of the existing
- afserver protocol, again more for speed of implementation of both library
- and host server than for elegance (the protocol has a number of
- well-known drawbacks.)
-
- In addition to providing a full sequential C run-time library, functions
- also had to be added to support Parallel C's specialised thread
- creation, message passing, sempahore, and timer calls. On the 'C40 most
- of these operations are implemented by extracodes which trap to the
- software kernel described in the next section, rather than being
- performed in-line as on the transputer.
-
- 9.3 The Kernel
-
- As described previously, the lack of built-in support on the 'C40 for
- processes and timeslicing, the partially asynchronous nature of its
- links requiring extra protocol to ensure synchronisation, and the
- dynamic management of DMA channels, all require the presence of a small
- software kernel on every 'C40 network node running Parallel C. This is
- the main difference from the transputer implementation, which relies
- entirely on the processor's firmware kernel, except for network loading
- software and the packet routers in a flood-configured application.
-
- Implementing a kernel to manage prioritised process queues, device
- interrupts and timers is classic computer science textbook material, but
- there are a number of complicating factors. The first is the necessity
- to minimise the software overhead involved in context switching between
- processes. Second, memory management must be able to cope with
- essentially any number of threads being created at run time by the
- client tasks.
-
- Given that the kernel was to be written from scratch, the choice of
- coding language was open. In the end, both because of the requirement
- to minimise context switch times and because a lot of the code in the
- kernel was in any case to do with intimate manipulation of the internal
- state of the processor, assembly language was chosen.
-
- The kernel provides services to client user tasks at the the level of "send this
- message to that port", "wait for timer", "create new process".
-
- Note that the message operations provided by the kernel are close to the
- hardware. It is responsible for allocating and setting up DMA
- channels, initiating transfers, and notifying clients of completion. It
- keeps track of all device interrupts.
-
- However, in our design the kernel in the end was not made responsible
- for message synchronisation. This job is done instead by a higher-level
- software package; this software is described in the next section, along
- with our reasons for separating it from the kernel proper.
-
- 9.4 Message Synchronisation: VCR and UPR
-
- For some time before the idea of a port of Parallel C to the 'C40 had
- appeared, we had been surveying various of the general-purpose message
- routing packages which had become available for the transputer, partly
- as a result of 3L's involvement in the SERC/DTI Transputer Initiative's
- working groups on parallel systems standardisation.
-
- The packages under review included commercial products such as Express
- (Parasoft) and FNODE (Sang), academic work like TINY (Edinburgh
- University) and VCR/UPR (Southampton University), and numerous bits of
- ad-hoc software written by individual users of Parallel C. In-house
- ideas were also being explored.
-
- By the time design of the 'C40 port was under way, VCR/UPR [5],
- developed at Southampton University under the auspices of the European
- Community's PUMA research programme, had become our "favourite" package,
- and detailed in-house evaluation of it was proceeding.
-
- When the question of how we should handle sender/receiver
- synchronisation and support for transputer-style ALTs over the basically
- asynchronous 'C40 link hardware arose, a ready-made solution was
- therefore at hand. VCR (Virtual Channel Router) provides exactly the
- facilities we required (synchronised message send and receive, and ALT
- with transputer semantics too) and is layered on top of UPR (Universal
- Packet Router), a low-level packet switcher which makes no assumption
- that the hardware link layer it runs on has transputer semantics.
-
- In the light of our aim of combining speedy development with a robust
- end product, the course was clear: we should incorporate VCR/UPR into
- Parallel C for the 'C40 as the "link-control software" discussed
- previously. The job of writing a non-deadlocking link protocol handler
- would be simplified to adapting VCR/UPR to our software environment.
-
- Using VCR/UPR (written in C) had the additional advantage of simplifying
- the job of our (assembler) kernel, making it easier to construct and
- more robust. All the kernel had to do now was provide simplified system
- services which could be called from UPR to access the raw link hardware.
-
- Relatively few modifications have had to be made to VCR/UPR itself to adapt the
- package to the 3L software environment. The primary ones have been to
- provide emulations of the transputer process-control libraries used by
- the original code, and to allow for the possibility that some of the
- available links on a node might be dedicated by the configurer to "raw"
- (non-virtual) traffic, not under the control of VCR/UPR.
-
- VCR/UPR is tightly coded and adds minimal overhead to message transit
- times. Nevertheless, easy direct access to the underlying hardware
- without unnecessary overhead has always been a strength of Parallel C.
- A function library providing access to "raw" (unsynchronised,
- non-virtual) 'C40 links is therefore to be provided. This will be of
- particular benefit to pipelined DSP applications, allowing peak link
- performance to be reached in critical inner loops. The user retains
- control of the tradeoff between the convenience, well-understood
- semantics and compatibility of synchronised transputer-style
- communication, and the additional performance which going down to the
- native hardware mechanisms can bring.
-
- Since at present Parallel C only supports nearest-neighbour
- communications, those parts of VCR/UPR which handle message routing
- between nodes which are not direct neighbours have been eliminated from
- the 'C40 version. This simplified the porting process and also
- minimised the amount of code present on each node. However, it should
- be clear that adoption of the VCR/UPR framework provides considerable
- scope for future enhancement of Parallel C in this area of general
- message routing, both on the 'C40 and the transputer.
-
- 9.5 The Configurers
-
- Three main issues arose with the configurers.
-
- The first concerned the machine on which the configurer should run. The
- transputer implementations run on a transputer board, using a host file
- server for I/O just like normal user programs. However, the compiler
- and linker from TI run as 80x86 programs under DOS, not as 'C40 code
- communicating with a file server.
-
- Both approaches have their own advantages and disadvantages: the
- server-based approach makes porting to different hosts easy; running
- directly on the host can sometimes be faster because file I/O does not
- have to be performed remotely. What was certain however, was that a
- mixture would only have the worst of both worlds. Since the TI compiler
- and linker were not server-based, the choice was made for us: we would
- port the configurers from the transputer to the 80x86.
-
- In fact, even although the C code had originally been written for a
- 32-bit machine with a flat address space (the transputer), it proved to
- be sufficiently well structured that a 16-bit version for the 286/386
- was produced quite swiftly.
-
- The second configurer issue arose from the nature of UPR, and the
- environment its original designers had assumed. UPR is driven by
- externally-generated routing tables. In the Southampton implementation,
- these tables were calculated "once and for all" for each hardware
- topology of interest. However, because in the 3L implementation the
- user's configuration file could specify for performance reasons that
- particular hardware links should be dedicated to "raw", non-virtual
- traffic, and these links would consequently be unavailable for use by
- UPR in routing messages, the routing tables could not be set up before
- configuration time.
-
- Evidently, UPR routing table generation would have to be deferred until
- configuration time. Not only did this require splicing code from the
- table-generation programs into the heart of the configurer, it also
- raised the issue of the speed and memory requirements of the routing
- table generation algorithms. In the event, some judicious tuning
- sufficed to make routing table generation during configuration feasible.
-
- The third main issue requiring modifications to be made to the
- configurer concerned the differences between handling the .B4-format
- task image files generated by the 3L transputer linker and the
- COFF-format images produced by the TI linker. .B4 files are
- position-independent, so the transputer configurer has little work to do
- in arranging for arbitrary groups of tasks to be loaded onto a single
- node. COFF objects are position-dependent, but contain the relocation
- information required to reposition them. The required on-the-fly
- relocation of COFF task images had to be incorporated into the 'C40
- configurers.
-
- 9.6 The Host Interface
-
- Parallel C applications for the 'C40 expect disk I/O and similar services to
- be provided by a server program running on a host computer. The
- Parallel C run-time library communicates with the server by some message
- protocol over one of the links of the "root" processor in the network.
-
- In the transputer world, in spite of the diversity of available host
- hardware, from PCs and Amigas to VAXen and IBM 3090 mainframes, from the
- point of view of the transputer end of these host communications the
- physical interface was always the same, no matter what host hardware was
- in use: the root transputer would simply send and receive server
- protocol messages over the link from which it had itself been
- bootstrapped. This meant that the same transputer code would work with
- any host system supporting the server protocol. A superfluity of rather
- ill-thought-out, ad-hoc server protocols confused the picture, but that
- is a separate issue.
-
- Even the host of the party had a fairly easy time, especially if it was
- running on an IBM-compatible PC, since the early availability and wide
- take-up of Inmos' original B004 development system had led to that
- board's ISA-bus hardware interface design becoming a de facto standard
- for other PC board vendors. This led to the happy situation where it
- was possible to take PC-oriented transputer software and run it "out of
- the box" on boards from a wide variety of different manufacturers. The
- B004 design's limited host I/O bandwidth of the order of 100KB/s eventually
- proved overly restrictive, leading to a proliferation of incompatible
- "go-faster" interfaces. Nevertheless, the basic B004 interface was
- usually retained as a lowest common denominator for software
- portability. The later B008 interface also achieved wide currency due
- to Inmos' decision to be a major player in the market for board-level
- transputer systems.
-
- The evolution of 'C40 board interfaces already looks set to proceed in a
- different direction, largely because no "definitive" hardware design has
- become available early enough to set a trend. TI has successfully
- sponsored the "TIM-40" standard [6] for plug-in 'C40 "daughterboard" modules
- corresponding in scope to Inmos' TRAM standard, but this clearly could
- not set a standard host/motherboard interface for the wide variety of
- host buses which must be supported. In practice all kinds of
- host interfaces to TIM-40 motherboards are appearing, ranging from
- bit-serial JTAG booting through message communication over one (or two)
- of the links on the root 'C40, right up to shared dual-ported
- RAM bypassing the links completely. A number of non-TIM standard boards
- are also becoming available.
-
- 3L's obvious goal was to support all these diverse interfaces while
- minimising the software effort involved. The software effort is not
- confined to the server and so could not easily be left to OEMs and
- third-party resellers to handle. Clearly, the root 'C40 must "know" how
- to communicate with the host: should messages be sent over a link (which
- one), or written into a shared memory buffer (where, and in what
- format)?
-
- This knowledge could either be "built-in" to a multitude of variant
- hardware-specific versions of 'C40 Parallel C, or we could take a leaf
- out of the PC software industry's book (where diversity of graphics
- "standards" presents a similar problem) and use hardware-specific software
- drivers for different boards. These drivers can in principle be
- created independently of 3L by hardware manufacturers, and would allow
- Parallel C to offer timely support for new hardware developments.
-
- A software driver approach has in fact been adopted: all communication
- from the root to the host is controlled by the driver, which acts as a
- plug-in extension to the kernel. Note that for users to be able to run
- their Parallel C programs "out of the box" on different hardware, a
- particular driver cannot be bound into the application image file.
- Instead, the required driver is loaded at run time from the
- host-dependent server into the root 'C40, along with the kernel and user
- code.
-
- 10. Meeting the Management Challenge
-
- Having seen the technical problems which had to be addressed in porting
- Parallel C to the 'C40, we shall round off this paper by looking briefly
- at how the software engineering management problem of sucessfully tackling a
- project of this scope with limited resources and a fixed deadline was
- addressed.
-
- The human resources available consisted of up to four full-time
- developers plus two test/QA staff over a planned project timescale of
- six months from the start of active work to shipment of initial
- software. The deadline had been self-imposed by the company's decision
- to announce publicly from the time the port was agreed with Texas
- Instruments in August 1991 that the product was going to be launched
- with due fanfare at CeBIT in March 1992 (part of the Hannover Fair.)
-
- In fact this deadline, by giving everyone involved in the project a
- "red line" to work towards proved to be a blessing, concentrating minds
- and effort.
-
- Although very little contingency "slack" was available in the plan, an
- early start on those parts of the project which were independent of the
- hardware, notably porting of the configurers from the transputer to the
- 80x86 and learning about the internals of VCR/UPR, allowed us to absorb
- what turned out to be a considerable slippage in the planned
- availability of the board-level hardware. Making as much use as
- possible of the instruction-level 'C40 simulators which were available
- from TI also helped.
-
- With limited forces, the disposition of the troops becomes of paramount
- importance. Early agreement on the main goals of the project was the
- greatest help in clear decision-making and efficient execution: the
- product was to be robust from day one; it was to be ready by March 1992.
- These goals and limited resources meant putting into practice the
- prevailing gospel of software reuse. We could simply not afford the
- reinvention of any wheels, large or small.
-
- This constraint led directly to the decisions to incorporate TI's
- existing compiler and linker into the package, and to adopt a proven
- message passing mechanism (VCR/UPR) developed in academia against the
- strong and understandable temptation to "roll our own". Both of these
- decisions gave considerable extra "leverage" to the work of our
- team by basing our work on the efforts of large groups elsewhere who had
- already devoted many man-years of effort to these topics.
-
- 3L is particularly proud to have been involved in the commercialisation
- of some of the products of EC-sponsored academic research (the VCR/UPR
- package developed as part of the PUMA programme of resarch into parallel
- architectures.) A rare feat for a small UK company!
-
- 11. Remaining Problems
-
- Most Parallel C code for the transputer which does not
- perform unneccessary low-level manipulation of the hardware (fiddling
- with process queues or writing directly to hard link channel addresses)
- can run happily on the 'C40.
-
- The main stumbling blocks are inherent in the 'C40 architecture: some
- programs will be unable to cope with an environment where bytes are
- word-sized, and where floating point double precision is no wider than
- single precision. Most applications in the DSP and real-time embedded
- system area targeted by Texas Instruments for the 'C40 should not find
- this too much of a problem.
-
- 12. Conclusion
-
- The 'C40 is well-suited to running DSP, real-time and desktop
- supercomputer applications written in 3L Parallel C an order of
- magnitude faster than the previous transputer implementation, although
- some applications which require intensive character manipulation or
- double-precision floating point may be unsiuted to the 'C40
- architecture.
-
- Programs written for the 'C40 in Parallel C retain the advantages of
- clarity and reliability which Parallel C inherits from its
- CSP/transputer roots. Users who have invested in learning
- occam/CSP-style techniques of parallel programming can immediately apply
- their experience to this new generation of hardware. In addition, users
- of previous generations of TI DSPs gain access to this powerful body of
- technique for reliable parallel systems development which is of proven
- utility.
-
- In undertaking this porting exercise, a real commitment to software
- reuse has helped us to meet our demanding timescale.
-
- The demonstration of 3L's commitment to migration of Parallel C and its
- popular Application Program Interface to new state-of-the-art parallel
- processors also frees users from undue dependence on a single source for
- parallel hardware, which is bound to have a positive effect on the
- availability of parallel applications software.
-
-
- References
-
- [1] TMS320C4x User's Guide. Texas Instruments. 2564090-9721 revision A.
- May 1991.
-
- [2] A.D. Culloch, Parallel Programming Toolkit for 3L-C, Fortran and
- Pascal. In: Jon Kerridge (Ed.), Proceedings of the 8th Technical
- Meeting of the Occam User Group. March 1988.
-
- [3] Parallel C User Guide. 3L Ltd., 1991.
-
- [4] P.S. Robertson, Intermediate Codes for Optimising Compilers. Ph.D.
- Thesis. Edinburgh University, 1979.
-
- [5] M. Debbage, M.B. Hill and D.A. Nicole. Virtual Channel Router
- version 2.0 user guide. Technical Report, University of
- Southampton, June 1991.
-
- [6] "TIM-40" TMS320C4x Module Specification. Draft 0.105, January 1992.
- Texas Instruments, Manton Lane, Bedford MK41 7PA, England.
-
-
- - --
- George Vokalek, Dept. of Mechanical Engineering,
- University of Adelaide, GPO BOX 498,
- gvokalek@augean.eleceng.ua.oz.au Adelaide 5001, South Australia
- Phone 61-8-228-4704, Fax 61-8-224-0464
-
- ------- Received message ends ----
-