NetNews Usenet Archive 1992 #16

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #16 / NN_1992_16.iso / spool / comp / dsp / 1796 next >

Wrap

Internet Message Format | 1992-07-20 | 41.0 KB

Path: sparky!uunet!munnari.oz.au!yoyo.aarnet.edu.au!sirius.ucs.adelaide.edu.au!augean.eleceng.adelaide.edu.AU!gvokalek From: gvokalek@augean.eleceng.adelaide.edu.AU (George Vokalek) Newsgroups: comp.dsp Subject: Re: Looking for users of TI TMS320C40 DSP chip and/or 3L 'C' compiler Message-ID: <1992Jul21.064853.29476@augean.eleceng.adelaide.edu.AU> Date: 21 Jul 92 06:48:53 GMT References: <5362@mccuts.uts.mcc.ac.uk> Organization: Electrical & Electronic Eng., The University of Adelaide Lines: 853 It really pisses me off when I type in a reply to a post and the email bounces! Therefore I will share my annoyance with the world by posting my reply: ----- Transcript of session follows ----- While talking to nsfnet-relay.ac.uk: >>> DATA <<< 554 No date field given 554 ehgabp2@uts.mcc.ac.uk... Service unavailable ------- Received message follows ---- Received: by augean (5.61+IDA+MU/4.8Z) id AA29036; Tue, 21 Jul 1992 16:11:19 +0930 Date: Tue, 21 Jul 1992 16:11:19 +0930 From: gvokalek (George Vokalek) Message-Id: <9207210641.AA29036@augean.eleceng.adelaide.edu.au> To: ehgabp2@uts.mcc.ac.uk Subject: Re: Looking for users of TI TMS320C40 DSP chip and/or 3L 'C' compiler Newsgroups: comp.dsp References: <5362@mccuts.uts.mcc.ac.uk> In comp.dsp you write: >We are looking for users of the Texas Instruments TMS320C40 DSP chip >and the 3L 'C' compiler, which produced object code suitable for this >chip, hosted on a PC AT. >Has anyone done any comparisons with the Inmos T800 using a 3L compiler? While I cannot answer your question directly, I can say that the C40 should run rings around the T800, especially in floating point. The C40 is more comparable in performance to the mythical T9000. Our transputer people have just opted for the Inmos C compiler rather than the 3L, since we do more embedded type work (rather than big hyper-parallel-type-work), and the 3L is 2-3 times the cost of the Inmos. I have also avoided the 3L compiler for the C40 (we are waiting for our C40 hardware to arrive) for similar reasons. I do have an electronic copy of a paper by 3L discussing its porting of the compiler to the C40. That paper is appended at the end of this message. (Thanks to S.J.Bradshaw at Traquair Inc. for providing it to me). Regards, George. ****************************************** Porting the 3L Parallel C Environment to the Texas Instruments TMS320C40 Alan D. Culloch, Director 3L Ltd., Peel House, Ladywell, Livingston EH54 6AG, Scotland Abstract. The TMS320C40 ('C40) is a transputer-like parallel processor from Texas Instruments. It is an order of magnitude faster than the T800 transputer. Parallel C is a popular programming environment for the transputer. The properties of both the 'C40 and Parallel C are described and the significant differences between the 'C40 and the transputer are pointed out. The techniques used to overcome these obstacles to porting Parallel C are presented. These include building a new real-time kernel and reusing existing software packages from industry and academia. The suitability of the 'C40 for parallel applications is discussed. 1. The TMS320C40 Chip The TMS320C40 ('C40) is the latest member of Texas Instruments' TMS320 family of Digital Signal Processing (DSP) chips. It is the first processor from a major US semiconductor vendor to feature transputer-like built-in communications links for parallel processing. In addition, it gives state-of-the-art single processor performance equalling or exceeding existing ``superchips'' such as the i860. The device is fully described in [1]. The performance of the 'C40 is certainly impressive: 275 million operations per second and 50 MFLOPS. Six 160Mbit/s communications links give each node in a network a total communications throughput of 120Mbyte/s. The parallel processing features of the 'C40 were added in response to feedback from customers. They indicated that large parallel DSP and real-time systems were already being built using the previous generation of standalone chips (the 'C30). What was wanted was a way to simplify the design and implementation of such systems. This was achieved by adding fast communications links to the processor. These make it easy to construct large MIMD parallel systems with no external communications support logic. 2. Availability Sample 'C40 silicon has been available since this formal launch of the chip in June 1991. Board-level hardware became available from European vendors such as Hema (Germany) and Hunt Engineering (UK) by the end of 1991. We now look in turn at each of the features of the 'C40 design which are most significant for parallel systems development. 3. Memory Architecture 'C40 memory is always organised in 32-bit words. Byte addressing is not supported. I.e., the first word in memory is at address 0, the next word is address 1, and so on. Integer, 32-bit floating point and character data items all occupy one word each (although it is possible to pack characters into integers "by hand" by shifting and masking.) In C terms, the sizeof all atomic types is 1. 3.0 Data Formats Floating-point data are held in a proprietary 32-bit format. Instructions are provided to convert to and from the IEEE standard 32-bit format. Intermediate results can be held in the processor's internal registers in an extended-precision 40-bit format. 3.1 Memory Buses The 'C40 has two external memory buses, called the local bus and the global bus. The first 2Gword half of the 4Gword (16Gbyte) 32-bit address space accesses the local bus; the other half accesses the global bus. Having two external memory interfaces provides greater total CPU/memory bandwidth and increases the available parallelism. The first 3Mword of the local bus region is reserved for peripherals (including internal peripherals like the on-chip timers and DMA engines) and two blocks of internal RAM. Depending on the setting of an external pin, the first 1Mword, from 0-1M, is mapped either to the local bus or an on-chip ROM containing built-in bootstrap code. 3.2 Internal RAM The 'C40 has two blocks of on-chip RAM. Each block is 1Kword (4Kbytes) long. The on-chip RAM can be used for code or data: it appears at the end of the 3Mword of reserved space below the local bus region, and is contiguous with the start of local bus memory. 3.3 Parallel Memory Access Each different block of memory (internal ROM and RAM blocks 0 and 1, the external memory buses) can support two concurrent accesses, allowing a considerable amount of micro-parallelism in the execution of sequential code. For example: the CPU can access two data values in one RAM block and perform an external program fetch in parallel with one of the DMA engines loading another RAM block, all within a single cycle. 3.4 Cache Memory To minimise fetching of program code from relatively slow external memory the 'C40 contains 512 bytes of on-chip cache for instructions. This is in addition to, and operates independently of, the two blocks of general-purpose on-chip RAM. 4. CPU Architecture 4.1 Register File The CPU has 32 registers. Twelve of these are general-purpose registers which can hold either word data or floating-point data in the processor's extended-precision format (40 bits wide.) Another eight general registers can hold word data only. The remaining registers have dedicated uses, as stack pointers, index registers, loop counters and so on. 4.2 Instruction Set and Pipelining 'C40 instructions have a three-address format specifying two source operands and a destination location. Various operand addressing modes are provided (register, immediate, direct, indirect.) However, addressing is not completely general: instructions generally restrict the allowed addressing modes for their operands and results in various ways, for example, by requiring that the result of an operation be stored into one of the twelve extended-precision registers. All the usual operations on arithmetic and floating-point data are provided. One interesting group of instructions provides for loop control without the overhead of embedding compare and branch instructions in the instruction stream being executed. The processor contains some special-purpose registers for loop counting and holding pointers to the start and end of a block of code to be executed repeatedly. In its special "repeat mode" the processor implicitly moves the PC back to the start of the loop when it steps over the last instruction of the loop block. Inner loops can therefore dispense with the overhead of counting, comparing and branching at the macroinstruction level. As is now the norm for RISC designs, delayed branches allow overlapping of useful instructions with the fetching of oout-of-sequence instructions. The 'C40 in fact goes a lot further down this path to increased performance, by the provision of quite a large number of "parallel" instructions. These allow particular pairs of operations to be overlapped in a single instruction cycle, for example multiplication of two floating-point values can be performed in parallel with storing another floating-point value in memory. The source and destination locations for both instructions must be encoded into a single instruction word, and are therefore more restricted in the addressing modes allowed than other types of instruction. 5. Inter-Processor Communication 5.1 Communications Ports The 'C40 has six built-in communications ports. Each can handle bi-directional communication at 20Mbyte/s, giving a total maximum throughput of 120MByte/s. Ports are connected to each other by a 12-wire parallel interface (8 data bits plus 4 control lines.) Although the links are bi-directional, they are not full duplex. Data may only flow across the link in one direction at a time. However, "turnaround" of the direction of the traffic is automatic when the link is accessed. Ownership of the link is managed by a hardware token-passing scheme which uses some of the four control lines. On reset, three of the links go into output mode; the other three are forced to input mode. A board designer must ensure that in the reset state outputs are connected only to inputs, and vice versa. Failure to do so can destroy the processor. Thankfully this problem is not visible to software. 'C40 links are buffered. Both the input and the output side of each link have an 8-word hardware FIFO queue attached. Therefore there is effectively 16 words of buffering on a full communication. Writing to a full FIFO, or reading from an empty one, blocks the CPU until the condition is cleared. Reading and writing the links is straightforward. The communication ports appear as mapped devices in the 'C40 address space. Sending a word over an output link is simply a matter of writing a word to the address of the mapped device. 5.1 The DMA Coprocessor The 'C40 provides an on-chip DMA engine to relieve the CPU of the need to intervene in inter-processor data transfer. The DMA engine has its own separate internal buses connecting it to memory and the communications ports. It can therefore transfer data without interfering with the operation of the CPU. The DMA engine is itself a quite sophisticated programmable device. It is driven by control blocks in memory which can be linked together to form DMA "programs". Each block specifies the source, destination and size of an individual transfer. Indexing allows non-contiguous data to be transferred (e.g., every 10th word from an array.) Bit-reversed addressing is provided for FFT applications. The basic operation of the DMA engine is fully general: it can move words from any part of memory to any other. After each word is transferred, the next is addressed by adding an index value from the DMA control block to the previous address. This, combined with the fact that the communications ports are mapped as memory locations, allows the DMA engine to transfer a block of words to or from a fixed location (such as a communications port) by specifying a zero "index" value. A further benefit of this approach is that data can be routed directly through a 'C40 by the DMA engine from one communication port to another, without any CPU intervention at all. 5.2 Timers The 'C40 has two on-chip timer/event counter units which can be used to generate interrupts and support timeouts on link communications. 6. 3L Parallel C Parallel C was originally developed by 3L for the transputer. To allow porting of applications, a 'C40 implementation of Parallel C would have to support the same user interface as the transputer implementation. The main functions provided by the transputer version of Parallel C which would have to be reproduced in a 'C40 port are outlined here. Parallel C is more fully described in [2] and [3]. In addition to a conventional C compiler, linker and run-time library for the target processor, Parallel C provides the following. 6.1 Extended Run-Time Library The C run-time library is extended with data types representing transputer channels and with functions to: create new threads of execution (processes); pass messages between tasks in a synchronised CSP-style fashion, with optional timeout handling; wait/signal on semaphores; and perform a CSP-style "alternative wait" operation. This last blocks the calling task until one of a number of input channels becomes ready to communicate. In the transputer implementation the message-passing functions can be implemented by in-line code accessing built-in hardware facilities. 6.2 Configurer In Parallel C, complete systems are built by composing independent software tasks. Each main task which is to execute in parallel is an independently compiled and linked C program. The tool which binds these linked task images together with various bits of bootstrap and loader software to produce a bootable multi-tasking application image file is called the configurer. It is driven by a user-written text file which describes the required hardware configuration, specifies the files containing the software tasks, and indicates which tasks are to be placed onto which processors. Replicated tasks placed on many different processors are held once in the application image file and copied from processor to processor dynamically at load time. 6.3 Flood Configurer This variant configurer takes two special kinds of task, the master and worker tasks of a processor farm application, and combines them with standard bootstrap, loader and message-routing software to make a bootable application image which, when loaded, spreads out and "floods" every accessible processor in the network with a copy of the same worker task. Work packets are distributed from the master to the workers and result packets are carried back through the network by the routing software. Packet broadcasts from the master to all workers are supported for initial setup of data to be worked on in parallel. 7. Rationale for Porting Parallel C >From the description of the 'C40 processor architecture given previously it should be clear that it fits the original sense of the term "transputer", in that it is a programmable device with its own integral memory and inter-processor communications links. It is therefore sensible to consider porting parallel software between these platforms. The interesting differences between Inmos' and Texas Instruments' implementations of this concept are dealt with in a later section. Here, we just note that the similarities are sufficient to make it appear useful and sensible to make existing transputer software available on this new platform, and that this includes development systems such as Parallel C. This would be so even if it were not the case that the 'C40 is an order of magniture faster than the currently-available T800 transputers, in both raw processing speed and in communications bandwidth. However, because there is such a large performance benefit, we expect the take-up of this new technology by current transputer users to be quicker than it would otherwise be. The other main reason for undertaking a port was the popularity of Texas Instruments' previous 'C30 product in DSP work. The familiarity of DSP engineers with this product family should lead to fast take-up of the parallel processing 'C40 and a consequent need for parallel systems development software such as Parallel C. 8. Technical Issues Involved in a Port The main technical issues of interest in the porting process are to do with the architectural differences between the transputer and the 'C40. 8.1 Communications Links Transputer links differ from 'C40 ones in both the hardware and its appearance to the software. The 12-wire links of the 'C40 give greater speed (20Mbyte/s vs. 20Mbit/s) but require more pins and board space. These links appear as memory locations to be read or written a word at a time under program control (or under control of the DMA coprocessor.) There are six of them rather than four, so a wider range of fixed link topologies can be built directly, including 3D grids and larger hypercubes. More importantly, the fact that the 'C40 links are hardware buffered means that they cannot be used directly to emulate transputer-style unbuffered, synchronised message passing. Consider a short message of two words sent from processor A to processor B. If the receiving process on B is not ready to receive the message, then in the transputer case the sending process will be blocked until the receiver becomes ready. If 'C40 links are used "raw", then either the message will be sent and A will proceed asynchronously, or the FIFOs will be full up, and the whole of processor A (not just the sending process) will be blocked. This problem implies that a 'C40 implementation of Parallel C will require some low-level software to control the hardware links. All transputer-style synchronised message traffic on a link must be under the control of this software. Note that some or all links may be used "raw" to avoid the software overhead involved here, but that communications over those links would not have transputer-like synchronisation properties and would have to be carefully managed under user control. Ensuring synchronisation of sender and receiver requires the link control software at both ends of a communication path to exchange acknowledgements of receipt using some protocol over and above the user data being transmitted. The question of what protocol is used is addressed later. Although 'C40 links are hardware FIFO-buffered, when a buffer fills up, the whole processor (not just the sending or receiving process) is blocked. User code must either poll "ready" flags to avoid the situation, or make use of interrupt signals which indicate when the buffers become ready. These interrupts can be used in combination with the DMA engine. To avoid one thread blocking the whole processor, the handling of the links and these interrupts must be centralised in the low-level support software. Software is also required to handle the discrepancy between the number of possible concurrently active communications on the links (12, one in each direction on six links) and the six available DMA channels. This requires dynamic software management of free DMA channels. 8.2 Processes Unlike the transputer, the 'C40 has no built-in firmware to manage process creation, timeslicing or message synchronisation: these functions must be performed by software: another job for the low-level software. Note that the large number of CPU registers in the 'C40 put a premium on the run-time efficiecy of such a software context-switching mechanism. There is some built-in software on the 'C40 held in ROM: this is a bootstrap program which allows networks of 'C40s to be booted from their links. 8.3 CPU Architecture A number of software changes are obviously implied by the differences in internal architecture between the two processors. The register vs. stack-based assembly programming model, the very different instruction set, the non-IEEE floating-point format, and the lack of byte addressibility obviously all conspire to require a completely different compiler. 9. Overcoming the Differences In deciding how to go about the port and overcome the technical difficulties mentioned above, two goals were paramount: 1. A robust product was required. 2. It should be implemented as quickly as possible. 3. As far as possible the resulting product should retain the application programmer interface of the transputer version. This had two main implications for our design: 1. Reuse as much existing, tested software as possible. 2. A number of transputer features would need to be emulated in software. In the rest of this section, we look at the different software components of Parallel C, and how they have changed from the transputer version of the software. 9.1 The Compiler and Linker Two options were considered for the compiler: first, use TI's own; second, re-jig the existing 3L transputer C compiler by replacing its machine-dependent code generator module with a 'C40 version. This was a relatively easy choice given the previously stated goals. Even although the code generator in the existing compiler is quite well separated from the rest of the program, using the intermediate-code technology described in [4], and the team had considerable experience with such work, it is still quite large (15-20K source lines.) Writing a new code generator would consume a disproportionate fraction of the resources available to the project. Since TI's compiler was already in an advanced state of development at the beginning of the project, and was an evolution from the existing code-compatible 'C30 compiler, we decided to use it as the basis of Parallel C. This met our goal of minimising development time, and by gaining access to well-tried code should prove to be robust as well. Subsidiary benefits of this route which also weighed in the decision were the resulting automatic compatibility with TI's assemblers, libraries, and any third-party libraries written for that compiler. 9.2 The C Run-Time Library and Host Server Because it was originally designed with standalone DSP applications in mind rather than more general parallel processing work, TI's C compiler is a "freestanding" implementation, in the sense of the ANSI standard. I.e., its library does not provide standard I/O and other similar high-level services. 3L has therefore had to port those parts of its existing C run-time library to the 'C40, in order to provide the full run-time environment familiar to users of Parallel C. As on the transputer, the standard I/O services in the run-time library are provided by means of remote procedure calls from library code on the root node of the 'C40 network across a link to a host computer. The protocol used for communication with the host is a variant of the existing afserver protocol, again more for speed of implementation of both library and host server than for elegance (the protocol has a number of well-known drawbacks.) In addition to providing a full sequential C run-time library, functions also had to be added to support Parallel C's specialised thread creation, message passing, sempahore, and timer calls. On the 'C40 most of these operations are implemented by extracodes which trap to the software kernel described in the next section, rather than being performed in-line as on the transputer. 9.3 The Kernel As described previously, the lack of built-in support on the 'C40 for processes and timeslicing, the partially asynchronous nature of its links requiring extra protocol to ensure synchronisation, and the dynamic management of DMA channels, all require the presence of a small software kernel on every 'C40 network node running Parallel C. This is the main difference from the transputer implementation, which relies entirely on the processor's firmware kernel, except for network loading software and the packet routers in a flood-configured application. Implementing a kernel to manage prioritised process queues, device interrupts and timers is classic computer science textbook material, but there are a number of complicating factors. The first is the necessity to minimise the software overhead involved in context switching between processes. Second, memory management must be able to cope with essentially any number of threads being created at run time by the client tasks. Given that the kernel was to be written from scratch, the choice of coding language was open. In the end, both because of the requirement to minimise context switch times and because a lot of the code in the kernel was in any case to do with intimate manipulation of the internal state of the processor, assembly language was chosen. The kernel provides services to client user tasks at the the level of "send this message to that port", "wait for timer", "create new process". Note that the message operations provided by the kernel are close to the hardware. It is responsible for allocating and setting up DMA channels, initiating transfers, and notifying clients of completion. It keeps track of all device interrupts. However, in our design the kernel in the end was not made responsible for message synchronisation. This job is done instead by a higher-level software package; this software is described in the next section, along with our reasons for separating it from the kernel proper. 9.4 Message Synchronisation: VCR and UPR For some time before the idea of a port of Parallel C to the 'C40 had appeared, we had been surveying various of the general-purpose message routing packages which had become available for the transputer, partly as a result of 3L's involvement in the SERC/DTI Transputer Initiative's working groups on parallel systems standardisation. The packages under review included commercial products such as Express (Parasoft) and FNODE (Sang), academic work like TINY (Edinburgh University) and VCR/UPR (Southampton University), and numerous bits of ad-hoc software written by individual users of Parallel C. In-house ideas were also being explored. By the time design of the 'C40 port was under way, VCR/UPR [5], developed at Southampton University under the auspices of the European Community's PUMA research programme, had become our "favourite" package, and detailed in-house evaluation of it was proceeding. When the question of how we should handle sender/receiver synchronisation and support for transputer-style ALTs over the basically asynchronous 'C40 link hardware arose, a ready-made solution was therefore at hand. VCR (Virtual Channel Router) provides exactly the facilities we required (synchronised message send and receive, and ALT with transputer semantics too) and is layered on top of UPR (Universal Packet Router), a low-level packet switcher which makes no assumption that the hardware link layer it runs on has transputer semantics. In the light of our aim of combining speedy development with a robust end product, the course was clear: we should incorporate VCR/UPR into Parallel C for the 'C40 as the "link-control software" discussed previously. The job of writing a non-deadlocking link protocol handler would be simplified to adapting VCR/UPR to our software environment. Using VCR/UPR (written in C) had the additional advantage of simplifying the job of our (assembler) kernel, making it easier to construct and more robust. All the kernel had to do now was provide simplified system services which could be called from UPR to access the raw link hardware. Relatively few modifications have had to be made to VCR/UPR itself to adapt the package to the 3L software environment. The primary ones have been to provide emulations of the transputer process-control libraries used by the original code, and to allow for the possibility that some of the available links on a node might be dedicated by the configurer to "raw" (non-virtual) traffic, not under the control of VCR/UPR. VCR/UPR is tightly coded and adds minimal overhead to message transit times. Nevertheless, easy direct access to the underlying hardware without unnecessary overhead has always been a strength of Parallel C. A function library providing access to "raw" (unsynchronised, non-virtual) 'C40 links is therefore to be provided. This will be of particular benefit to pipelined DSP applications, allowing peak link performance to be reached in critical inner loops. The user retains control of the tradeoff between the convenience, well-understood semantics and compatibility of synchronised transputer-style communication, and the additional performance which going down to the native hardware mechanisms can bring. Since at present Parallel C only supports nearest-neighbour communications, those parts of VCR/UPR which handle message routing between nodes which are not direct neighbours have been eliminated from the 'C40 version. This simplified the porting process and also minimised the amount of code present on each node. However, it should be clear that adoption of the VCR/UPR framework provides considerable scope for future enhancement of Parallel C in this area of general message routing, both on the 'C40 and the transputer. 9.5 The Configurers Three main issues arose with the configurers. The first concerned the machine on which the configurer should run. The transputer implementations run on a transputer board, using a host file server for I/O just like normal user programs. However, the compiler and linker from TI run as 80x86 programs under DOS, not as 'C40 code communicating with a file server. Both approaches have their own advantages and disadvantages: the server-based approach makes porting to different hosts easy; running directly on the host can sometimes be faster because file I/O does not have to be performed remotely. What was certain however, was that a mixture would only have the worst of both worlds. Since the TI compiler and linker were not server-based, the choice was made for us: we would port the configurers from the transputer to the 80x86. In fact, even although the C code had originally been written for a 32-bit machine with a flat address space (the transputer), it proved to be sufficiently well structured that a 16-bit version for the 286/386 was produced quite swiftly. The second configurer issue arose from the nature of UPR, and the environment its original designers had assumed. UPR is driven by externally-generated routing tables. In the Southampton implementation, these tables were calculated "once and for all" for each hardware topology of interest. However, because in the 3L implementation the user's configuration file could specify for performance reasons that particular hardware links should be dedicated to "raw", non-virtual traffic, and these links would consequently be unavailable for use by UPR in routing messages, the routing tables could not be set up before configuration time. Evidently, UPR routing table generation would have to be deferred until configuration time. Not only did this require splicing code from the table-generation programs into the heart of the configurer, it also raised the issue of the speed and memory requirements of the routing table generation algorithms. In the event, some judicious tuning sufficed to make routing table generation during configuration feasible. The third main issue requiring modifications to be made to the configurer concerned the differences between handling the .B4-format task image files generated by the 3L transputer linker and the COFF-format images produced by the TI linker. .B4 files are position-independent, so the transputer configurer has little work to do in arranging for arbitrary groups of tasks to be loaded onto a single node. COFF objects are position-dependent, but contain the relocation information required to reposition them. The required on-the-fly relocation of COFF task images had to be incorporated into the 'C40 configurers. 9.6 The Host Interface Parallel C applications for the 'C40 expect disk I/O and similar services to be provided by a server program running on a host computer. The Parallel C run-time library communicates with the server by some message protocol over one of the links of the "root" processor in the network. In the transputer world, in spite of the diversity of available host hardware, from PCs and Amigas to VAXen and IBM 3090 mainframes, from the point of view of the transputer end of these host communications the physical interface was always the same, no matter what host hardware was in use: the root transputer would simply send and receive server protocol messages over the link from which it had itself been bootstrapped. This meant that the same transputer code would work with any host system supporting the server protocol. A superfluity of rather ill-thought-out, ad-hoc server protocols confused the picture, but that is a separate issue. Even the host of the party had a fairly easy time, especially if it was running on an IBM-compatible PC, since the early availability and wide take-up of Inmos' original B004 development system had led to that board's ISA-bus hardware interface design becoming a de facto standard for other PC board vendors. This led to the happy situation where it was possible to take PC-oriented transputer software and run it "out of the box" on boards from a wide variety of different manufacturers. The B004 design's limited host I/O bandwidth of the order of 100KB/s eventually proved overly restrictive, leading to a proliferation of incompatible "go-faster" interfaces. Nevertheless, the basic B004 interface was usually retained as a lowest common denominator for software portability. The later B008 interface also achieved wide currency due to Inmos' decision to be a major player in the market for board-level transputer systems. The evolution of 'C40 board interfaces already looks set to proceed in a different direction, largely because no "definitive" hardware design has become available early enough to set a trend. TI has successfully sponsored the "TIM-40" standard [6] for plug-in 'C40 "daughterboard" modules corresponding in scope to Inmos' TRAM standard, but this clearly could not set a standard host/motherboard interface for the wide variety of host buses which must be supported. In practice all kinds of host interfaces to TIM-40 motherboards are appearing, ranging from bit-serial JTAG booting through message communication over one (or two) of the links on the root 'C40, right up to shared dual-ported RAM bypassing the links completely. A number of non-TIM standard boards are also becoming available. 3L's obvious goal was to support all these diverse interfaces while minimising the software effort involved. The software effort is not confined to the server and so could not easily be left to OEMs and third-party resellers to handle. Clearly, the root 'C40 must "know" how to communicate with the host: should messages be sent over a link (which one), or written into a shared memory buffer (where, and in what format)? This knowledge could either be "built-in" to a multitude of variant hardware-specific versions of 'C40 Parallel C, or we could take a leaf out of the PC software industry's book (where diversity of graphics "standards" presents a similar problem) and use hardware-specific software drivers for different boards. These drivers can in principle be created independently of 3L by hardware manufacturers, and would allow Parallel C to offer timely support for new hardware developments. A software driver approach has in fact been adopted: all communication from the root to the host is controlled by the driver, which acts as a plug-in extension to the kernel. Note that for users to be able to run their Parallel C programs "out of the box" on different hardware, a particular driver cannot be bound into the application image file. Instead, the required driver is loaded at run time from the host-dependent server into the root 'C40, along with the kernel and user code. 10. Meeting the Management Challenge Having seen the technical problems which had to be addressed in porting Parallel C to the 'C40, we shall round off this paper by looking briefly at how the software engineering management problem of sucessfully tackling a project of this scope with limited resources and a fixed deadline was addressed. The human resources available consisted of up to four full-time developers plus two test/QA staff over a planned project timescale of six months from the start of active work to shipment of initial software. The deadline had been self-imposed by the company's decision to announce publicly from the time the port was agreed with Texas Instruments in August 1991 that the product was going to be launched with due fanfare at CeBIT in March 1992 (part of the Hannover Fair.) In fact this deadline, by giving everyone involved in the project a "red line" to work towards proved to be a blessing, concentrating minds and effort. Although very little contingency "slack" was available in the plan, an early start on those parts of the project which were independent of the hardware, notably porting of the configurers from the transputer to the 80x86 and learning about the internals of VCR/UPR, allowed us to absorb what turned out to be a considerable slippage in the planned availability of the board-level hardware. Making as much use as possible of the instruction-level 'C40 simulators which were available from TI also helped. With limited forces, the disposition of the troops becomes of paramount importance. Early agreement on the main goals of the project was the greatest help in clear decision-making and efficient execution: the product was to be robust from day one; it was to be ready by March 1992. These goals and limited resources meant putting into practice the prevailing gospel of software reuse. We could simply not afford the reinvention of any wheels, large or small. This constraint led directly to the decisions to incorporate TI's existing compiler and linker into the package, and to adopt a proven message passing mechanism (VCR/UPR) developed in academia against the strong and understandable temptation to "roll our own". Both of these decisions gave considerable extra "leverage" to the work of our team by basing our work on the efforts of large groups elsewhere who had already devoted many man-years of effort to these topics. 3L is particularly proud to have been involved in the commercialisation of some of the products of EC-sponsored academic research (the VCR/UPR package developed as part of the PUMA programme of resarch into parallel architectures.) A rare feat for a small UK company! 11. Remaining Problems Most Parallel C code for the transputer which does not perform unneccessary low-level manipulation of the hardware (fiddling with process queues or writing directly to hard link channel addresses) can run happily on the 'C40. The main stumbling blocks are inherent in the 'C40 architecture: some programs will be unable to cope with an environment where bytes are word-sized, and where floating point double precision is no wider than single precision. Most applications in the DSP and real-time embedded system area targeted by Texas Instruments for the 'C40 should not find this too much of a problem. 12. Conclusion The 'C40 is well-suited to running DSP, real-time and desktop supercomputer applications written in 3L Parallel C an order of magnitude faster than the previous transputer implementation, although some applications which require intensive character manipulation or double-precision floating point may be unsiuted to the 'C40 architecture. Programs written for the 'C40 in Parallel C retain the advantages of clarity and reliability which Parallel C inherits from its CSP/transputer roots. Users who have invested in learning occam/CSP-style techniques of parallel programming can immediately apply their experience to this new generation of hardware. In addition, users of previous generations of TI DSPs gain access to this powerful body of technique for reliable parallel systems development which is of proven utility. In undertaking this porting exercise, a real commitment to software reuse has helped us to meet our demanding timescale. The demonstration of 3L's commitment to migration of Parallel C and its popular Application Program Interface to new state-of-the-art parallel processors also frees users from undue dependence on a single source for parallel hardware, which is bound to have a positive effect on the availability of parallel applications software. References [1] TMS320C4x User's Guide. Texas Instruments. 2564090-9721 revision A. May 1991. [2] A.D. Culloch, Parallel Programming Toolkit for 3L-C, Fortran and Pascal. In: Jon Kerridge (Ed.), Proceedings of the 8th Technical Meeting of the Occam User Group. March 1988. [3] Parallel C User Guide. 3L Ltd., 1991. [4] P.S. Robertson, Intermediate Codes for Optimising Compilers. Ph.D. Thesis. Edinburgh University, 1979. [5] M. Debbage, M.B. Hill and D.A. Nicole. Virtual Channel Router version 2.0 user guide. Technical Report, University of Southampton, June 1991. [6] "TIM-40" TMS320C4x Module Specification. Draft 0.105, January 1992. Texas Instruments, Manton Lane, Bedford MK41 7PA, England. - -- George Vokalek, Dept. of Mechanical Engineering, University of Adelaide, GPO BOX 498, gvokalek@augean.eleceng.ua.oz.au Adelaide 5001, South Australia Phone 61-8-228-4704, Fax 61-8-224-0464 ------- Received message ends ----