Apple’s Power Mac G5 computer has significantly increased speed, capacity and performance capabilities, giving you new opportunities to create powerful, innovative software. But to take full advantage of the G5 platform, you’ll need to understand how the G5 differs from previous processors, which new tools Apple provides to help you analyze your software for the G5, and how to optimize your application. Optimization is not just for high-end, math and science applications—every application can benefit significantly from modifications that range from very easy to very complex. You will have to decide for yourself how much to optimize; this article can help by guiding you through the decision-making process, and by explaining what’s involved in each set of options. If you are creating new applications, optimizing can help you get the most out of the G5 from the start; for existing applications, you can learn how to identify the parts of your application that can be tuned to run optimally on the G5. If you have already optimized your current products for the G4 processor, you’ll need to make a similar effort to optimize them for the Power Mac G5. Apple provides the tools you will need for your optimization efforts, and these are described below. The sections that follow summarize the various levels of optimization that are available, so you can choose the optimizations that are the most appropriate for you, according to your resources. But first, this article provides a quick summary of the Power Mac G5 Platform and how it is different than any other personal computer. The Power of the Power Mac G5Notice that this article does not focus on the G5 as a processor only, but rather on the entire Power Mac G5 system. Previous efforts at optimization strategy focused primarily on exploiting processor features to maximum advantage. But the scope of optimization work for the Power Mac G5 is broader; that’s because there’s more to the Power Mac G5 than just the G5 processor. Everything in the Power Mac G5 computer contributes to its overall excellence. In fact, you shouldn’t think of the Power Mac G5 as just a processor—rather, as a developer, you should think of it as a platform. Power Mac G5 computers aren’t just fast, they’re capacious—they have much higher capacities for computation, memory, disk storage, and data transfer. Developers who understand and fully exploit this will be able to design creative new Power Mac G5-based applications that, until now, would have been impossible on any personal computer. Where does the power of the Power Mac G5 platform reside? A quick tour of the architecture will help you to understand the new approach to optimization and what it means for you. 64-Bit G5 ProcessorThe G5 processor isn’t simply a faster version of the G4. Instead, it is a redesigned processor that implements the 64-bit architecture built into the original PowerPC when it was designed by the AIM Consortium almost a decade ago. The G5 processor draws on the IBM’s considerable experience in processor design. In fact, it is based on the execution core of IBM’s POWER4 processor, which drives IBM’s high-end pSeries 690 servers. A full description of how the G5 processor contributes to the superior performance and capacity of the Power Mac G5 platform would be an article in itself. The following characteristics of the G5 processor give some indication of its performance and capacity:
The initial Power Mac G5 product line offers models with 1.6GHz, 1.8GHz and Dual 2GHz PowerPC G5 processors; also, note that bus speeds and other components vary depending on the model. A Faster, Smarter System ControllerThe U3 system controller (which connects the G5 processor to the rest of the computer’s components) is a custom chip created using the same IBM 130-nanometer technology as the G5 processor itself. It supports point-to-point routing, which enables multiple subsystems to simultaneously exchange data with main memory without involving the G5 processor. Faster Memory, and More of ItThe U3 system controller also makes it possible for Power Mac G5 computers to use fast 400 MHz, 128-bit DDR (Double Data Rate) SDRAM. Power Mac G5 computers currently have either four or eight DIMM slots, which enables them to hold up to 8 GB of physical memory with today’s 1 GB DIMMs. As higher-capacity DIMMs become available, Power Mac G5 computers will be able to use them. This significantly higher level of memory capacity will certainly be at the heart of at least some of the future breakthrough applications for the Power Mac G5. In today’s computers, extremely large data sets (for example, video and complex 3-D models) must reside on hard disks, forcing data accesses periodically into the millisecond range. When all the data can reside in physical memory, data accesses will always be in the nanosecond range. In addition, data residing in main memory is easier to manipulate than data stored on a hard disk. 1 GHz Frontside BusThe Power Mac G5 computer connects the G5 processor to the system controller through a frontside bus, with a capacity of up to 1 GHz. This enables a tremendous increase in data throughput. Dual-processor Power Mac G5 models include separate frontside buses to each G5 processor; this gives them an extra speed advantage over dual-processor Intel computers, which force both processors to share a single bus. Why You Should OptimizeWith no changes whatsoever, most compiled software will run proportionately faster on Power Mac G5 computers simply because of the G5 processor’s higher clock speed and the computer’s higher data throughput and increased number of execution units. However, you can make your software run even faster by simply recompiling it. If you take further optimization steps (described later in this article), you may be able to get your software to run several times faster on a Power Mac G5 computer than it does on previous Power Mac computers. Tools for OptimizationThe software suite you absolutely must have is the Xcode Tools, the new development tools package from Apple that includes everything from the integrated development environment where you write, build and debug your applications, to human interface design tools, to performance optimization and debugging tools. The Xcode Tools include updated compilers, the gcc (GNU Compiler Collection) version 3.3, which Apple has augmented to work with Mac OS X and the G5 processor. The gcc 3.3 compiler includes a number of changes that are necessary to optimize code for the Power Mac G5 platform, including new compiler flags and much stricter adherence to the established language specifications than previous versions of gcc (see Technical Note TN2086: Tuning for the G5: A Practical Guide for details). The performance analysis tools that come with Xcode fall into two categories: software-only, non-invasive tools, both command-line and graphical, that operate at the process level; and the Computer Hardware Understanding Development (CHUD) tools, which rely upon dedicated hardware features to operate. Here are the highlights of a couple of them. For a starting point, investigate Sampler. An exploratory optimization tool, Sampler is a performance-measuring application that analyzes a program’s running behavior and its allocation of memory by stopping the program periodically to examine the function call stack. Sampler displays the functions that were most frequently seen while sampling was taking place. This information can help you locate those functions and sections of your code that are consuming large chunks of CPU time, as well as functions in your applications where excessive memory allocations are occurring. Sampler is one of the non-invasive tools that operates at the process level and has features which allow you to understand overall running behavior and application state over time, with no modification to your application code. Sampler’s hardware-based relative in the CHUD set is called Shark. This is an extremely valuable tool that also does time-based sampling of the computer running your software, telling you where the computer spends its time. The difference is that Shark can delve deep into the details of function usage, due to its measurement being linked directly to the hardware. It enables you to find the specific routines that will benefit the most from optimization. When Shark is used in conjunction with the rest of Xcode, it can also display your routines’ source code and highlight the individual lines of source code that are consuming the most processor time. In many cases, Shark will also suggest what you might try to increase performance. Shark can also display the assembly-language code associated with your source code and show you execution details (for example, instruction groupings and processor stalls) that you can use to make assembly-level optimizations. Note: An earlier version of Shark was named Shikari. Shark has been substantially upgraded specifically for use with the Power Macintosh G5 platform. That provides a starting point for optimization tools. You
should become familiar with Sampler and Shark as well as all the
other performance analysis tools so that you can incorporate
them into your standard development, debugging and quality
assurance process. Full documentation about using these tools is
installed when you install Xcode: see
Some Important GuidelinesThere are some general optimization-related guidelines that you should consider first when deciding which level of optimization applies to you. Re-optimize for Power Mac G5If you implemented processor-specific optimizations on your software in the past, you’ll need to implement a similar level of G5-specific optimization on your software to ensure comparable performance on all Power Mac G5 computers. This is necessary because the very same code changes that maximize performance on one processor may interact adversely with another processor. Harness Velocity Engine with vecLibIf your software does any amount of vector, matrix, or signal processing, you should seriously consider rewriting the appropriate code to use Apple’s vecLib framework, which gives you access to several vector-processing libraries, including BLAS (Basic Linear Algebra Subprogram) and the vDSP digital processing library. Using vecLib multiplies the benefits resulting from your effort:
If you are willing and able to write your own code for Velocity Engine, see the Velocity Engine web page for more information. Keep the Power Mac G5 Well-FedWhen optimizing, remember that Power Mac G5 computers are very hungry, very fast, and very sequential. This means that they consume very large amounts of data at one time, that they process it very quickly, and that nonsequential instructions and data accesses cause significant performance penalties. Many of the optimizations described below and in other Apple-supplied documentation cater to these characteristics. You should keep them in mind when you look for opportunities to optimize your software. Be Wise When You OptimizeOptimization does not happen in a vacuum; it produces side effects that may affect your program in other, unacceptable ways. For example, a program optimized for speed alone may be too large on disk, or its larger size may cause additional disk accesses that negate the speed increase of the code itself. For these and other reasons, you’ll probably find it necessary to optimize different parts of your code separately. The Xcode Tools enable you to add compiler flags on a per-module or per-file basis, thus giving you the ability to control how the compiler optimizes your program. You’ll need to test different optimization combinations to determine which ones enable you to meet your performance goals. Achieving the best possible performance for your program involves more than just optimizing it for the Power Mac G5 platform. You must also optimize the program itself, including such program optimizations as:
See the Performance section under For Further Information at the end of this article for resources to help you with this task. Resist the urge to optimize the code that you intuitively “know” needs it. Profile your code for hot spots, evaluating the effects of optimization not just on time alone but on the benefit that the user will perceive from it. Optimization, Level by LevelBecause every technical task exists within larger technical and business contexts, only you can decide which optimizations you should perform on your software. Your decision will include such factors as what your software does, technical expertise, and what benefits you expect to see from the optimization process. There are four levels of optimization for you to consider, starting with the easiest and working up to the most complex. In general, the lower the level, the more likely you are to implement the optimizations in that level. However, be aware that, even within the same level, different optimization tasks vary in difficulty, time to completion, and benefit/time ratio. For details on these and other optimizations, see Technical Note TN2086: Tuning for the G5: A Practical Guide and Technical Note TN2087: PowerPC G5 Performance Primer. Level 0 Optimizations: DefinitelyRecompile your software, using the -O3 flag to incorporate processor-independent optimizations. Examine your code for opportunities to consolidate multiple operations on small amounts of data into one operation on one large amount of contiguous data. It may make sense to preload larger amounts of data from remote sources (for example, reading an entire file into memory in one operation rather than line by line) or to use larger buffers. Converting data from one type to another (for example, from string to integer) is even more resource intensive on the G5 as opposed to earlier processors. Look for opportunities to minimize the amount of type conversion that your software does. Type conversions that can be done without memory accesses are significantly faster. Level 1 Optimizations: ProbablyRecompile your software using the flags (as appropriate) that implement G5-specific optimizations. Use Sampler to discover the routines where your software is spending most of its time (also known as “hot code” or “hot spots”). You may be able to improve this code’s performance by recompiling it using flags that unroll loops and replace subroutines with in-line code. Also look for opportunities to improve performance by rewriting the appropriate source code to be more efficient. If your software makes significant use of the square-root function, a simple recompilation of your code using the appropriate flag will cause the compiler to invoke the G5 processor’s built-in square-root instruction. Depending on your situation, this simple, fast change may give your software a noticeable boost with virtually no effort involved. Level 2 Optimizations: PossiblyProfile your running application using Shark and follow its recommendations for improving the performance of hot spots. Some of Shark’s recommended optimizations require some understanding of the G5 processor’s inner workings, but one of its recommendations—improving instruction alignment—is easy to understand and quick to implement. The G5 processor is more sensitive than previous processors to misaligned instructions. The G5 processor is negatively affected by certain key addresses when they are not aligned to 32-byte block boundaries. You can get the compiler to automatically align functions, loops, jumps, and jump targets to 32-byte block boundaries by compiling individual files with the appropriate optimization flags (for example, -align-loops=32). Because such optimizations increase the size of the resulting binary code, you must apply them sparingly and monitor their overall effect on the size of your software. If you have previously optimized your software for the G3 or G4 processors, remember to take your original processor-independent code and optimize it for the G5 processor. Level 3 Optimizations: For Maximum EffectThese optimizations provide the maximum performance increase, but they require significant technical knowledge about the behavior of the G5 processor. In most cases, they also involve a significant programming effort. By writing the appropriate code, you can maximize your software’s use of the G5 processor’s dual 64-bit floating-point units (FPUs), dual 64-bit integer units, and dual load/store units. For the appropriate numeric operations, you will achieve the highest possible performance by writing custom code for the built-in Velocity Engine. This can be a demanding process, and also a processor-dependent one that must be revisited for each G5 processor you plan on supporting. Remember that you can get most of the benefits of the Velocity Engine with only a small fraction of the effort by using the vecLib framework. Packaging your OptimizationsCode that has been optimized for the G5 by simple re-compilation will run without penalty on a G4. If you have done more in-depth, G5-specific tuning (levels 1, 2 and 3) then you will in all likelihood want to provide a separate binary. In extreme cases, you may decide that you need only offer one version of your software that runs on Power Mac G5 computers only. However, you’ll probably want to support most or all of the Macintosh product line, which means that you need to decide how best to deliver the right code to each of your customers. There are several ways to achieve this; the first is:
It is possible for your software to query the computer on which it is running to see which processor-related features are available. You can design your software to isolate processor-dependent code and call the appropriate version as needed. This leads to two additional strategies for packaging your application:
SummaryThis article is designed to get you started on optimizing for the G5; see the Technical Notes and other documents listed at the end of the article for more details. The G5 platform is important not just because of the G5 processor but also because of the operating system and the hardware around it. Together, these components implement next-level increases in computing power and data capacity. These increases deliver new opportunities to you, the developer—opportunities for new applications and even new categories of applications. It’s essential that you consider carefully your goals and resources before you begin to modify or design your applications. Although today’s applications, unchanged, will run faster on Power Mac G5 computers, you can add significant additional performance gains by optimizing your software for the Power Mac G5 platform. There are some optimizations that you should definitely implement on all your Macintosh software and others that you probably should implement. If you have performed processor-specific optimizations on your current software, you will need to implement G5-specific optimizations to make your software ready for the Power Mac G5 platform. For Further InformationOverview Information
Optimization
Performance
Posted: 2005-04-29 |