64 is More in the 128 Debate

How Matrox's advanced 64-bit architecture competes against
"128-bit accelerators"

The Matrox Millennium and the Matrox Mystique are the fastest graphics controllers in the industry as measured by industry standard tests such as Ziff Davis' WinBench 96 and WinStone 96, as results show in Table 1. For end users, these benchmarks illustrate the real world advantages of a graphics accelerator's performance.

Table 1 — Benchmark results. A larger number indicates higher performance.*
Board WinBench 96
1024x768@8bpp
WinBench 96
1152x882@24bpp
WinBench 96
800x600@24bpp
Matrox Millennium 45.4 35.5 35.2
Matrox Mystique 43.4 27.0 (1152 x 864) 33.2
#9 Imagine 128 Series 2 36.6 21.0 28.8
STB Lightspeed 128 40.5 Not Supported 29.5

Despite the proof of these benchmark results, some graphics board manufacturers have been making performance comparisons against Matrox's MGA Series of 64-bit accelerators with a 128-bit graphic accelerator. The comparisons focus on the 128-bit technology and argue the "wider" 128-bit datapath will automatically be faster. However, the fallacy of this argument can be explained and supported by the following technical explanations.

There are numerous variables that affect a graphics board's performance which will all be discussed in this document. These include driver optimization, a well-designed register set, effective use of the PCI bus, a well-balanced graphics engine, and maximum use of the memory's bandwidth and special features. However, since a number of vendors are concentrating their efforts on the "number of bits" marketing campaign, Matrox will address this issue with specific examples of how the 64-bit designs of the Matrox Millennium and Matrox Mystique provide much larger bandwidth than all existing 128-bit designs.

Bandwidth - what the numbers mean
Vendors like STB promote their graphics boards as "128-bit, " relying on the users' belief that a bigger number automatically means the board is faster. While some portions of the underlying technology of their product are 128-bit, other portions, such as the datapath to access memory, are only 32-bit. Labeling this architecture as 128-bit is therefore misleading, because only part of the architecture's bandwidth is 128-bit. By using the same reasoning, it can be argued that the Matrox Millennium and Matrox Mystique accelerators are "512-bit" accelerators. This is based on the fact that for many graphics acceleration operations as basic as text drawing, pattern fills*, clears, aligned bitblits*, and block writes, the data is handled in 512 bits by the Matrox processors. The Matrox Millennium and Matrox Mystique's dedicated memory controllers, respectively the MGA-2064W and the MGA-1064SG processors, access two memory chips at once, each with an internal 256-bit bus, providing the equivalent performance of a "512-bit" engine. Therefore, while Matrox characterizes its graphics controllers as "64-bit" controllers — because the datapath to memory is 64 bits wide — many fundamental graphics operations are actually handled in 512-bit, by taking advantage of the frame buffers' specialized memory, WRAM or SGRAM. This results in raw performance up to four times faster than 128-bit controllers for operations like block writes and aligned blits, two fundamental graphics operation in Windows and multimedia applications.

A concrete example of 512-bit speed
There are many instances when the graphics board is operating at 512-bit when using WRAM and SGRAM. One specific example is the acceleration of basic text drawing operations using the dual-color block write capabilities exclusively found in WRAM and the single color block-write capabilities of SGRAM. In this example, the Matrox Millennium can take advantage of two features in WRAM to accelerate the text drawing operation. The first feature is the dual color block write mode, which allows it to define each pixel as a 1-bit value instead of 8-bit. The second is the 256-bit internal bus of the WRAM, which allows it to then convert this 1-bit information back in 8-bit, in order to display it, 512 bits at a time (since both memory chips are operating at the same time.)

WRAM's dual-color block write
Text in common GUI applications like Microsoft Word and Excel typically consist of black characters on a white background. This means each character block can be divided into pixels of one of either two colors called the foreground color - in this case black - and the background color - in this case white.
Dual-Color block
When the Matrox Millennium receives the command to draw the letter "A", a simple conversion can be made by recognizing, regardless the color depth of the computer desktop, that each pixel making up the letter "A" can be specified with only 1 bit of information. If the pixel to be written uses the foreground color it is given the value of '1', and if it uses the background color it is given the value of '0'. This capability to handle the information as two set values is unique to WRAM, which is used only by the Matrox Millennium.
In the case where the desktop color depth is set at 8-bit, signifying that each pixel of information has an 8 bit per pixel (bpp) value, then a typical 64-bit engine could only calculate 8 pixels at one time (8 pixels x 8 bpp = 64 bits). Similarly, a 128-bit engine could calculate 16 pixels at one time (16 pixels x 8 bpp = 128 bits). Since Matrox's technology allows each pixel to be defined as a 1bpp value, and the chip's datapath is 64-bit, it can calculate up to 64 pixels at one time (64 pixels x 1 bpp = 64-bit).

WRAM's 256-bit internal bus
The MGA-2064W engine of the Matrox Millennium sends the stream of 64 1-bit pixels to the WRAM memory, which has the ability to store both the foreground and the background colors. The memory then converts the information in real time, expanding all colors to their full 8 bit values so the text can be displayed. The 256-bit internal memory bus inside each memory chip allows it to handle 32 8-bit color expansions simultaneously (32 pixels x 8 bpp = 256 bits). Since the Millennium accesses two memory chips at the same time and sends 32 bits of information to each chip, a total of 64 8-bit text pixels can be written to memory in one clock cycle (64 pixels x 8 bpp = 512 bits.) This results in performance four times faster than 128-bit accelerators, which as stated earlier could only output 16 pixels for each clock cycle.
In the case of the Matrox Mystique, the Synchronous Graphics RAM (SGRAM) memory also has a 256-bit wide internal bus and single color block write mode capabilities. Compared to WRAM's dual color block write, this means all of the information must be sent in two passes: once for the foreground color, and a second time for the background color, therefore outputting 32 bits of information at a time. This results in performance that is four times faster than other 64-bit accelerators and twice as fast as 128-bit accelerators.

Table 2- Number of pixels of text written per clock at different color depths
Controller Type 8 bpp 16 bpp 32 bpp
Typical 64-bit controller 8 4 2
128-bit controller 16 8 4
Mystique 512-bit controller 32 16 8
Millennium 512-bit controller 64 32 16

Other factors affecting graphics performance
As mentioned before, the best way to achieve the maximum performance with a graphics board is to ensure the accelerator has a well balanced, fully optimized design. This includes driver optimization, a well-designed register set, effective use of the PCI bus, a well-balanced graphics engine, and maximum use of the memory's bandwidth and special features.

How optimized are the drivers?
Matrox is the leading company in software driver technology. Moreover, all Matrox's graphics products are based on proprietary graphics engines. This gives the software engineers a closer relationship to the inner design of the MGA processors than would be possible if the processors used to manufacture Matrox's graphics cards were supplied by another company. By contrast, most other companies purchase generic chipsets and are limited to supplied drivers, or attempt to develop their own drivers based on the chip vendor's specifications.

How optimized is the register set?
Each operation performed by the graphics accelerator is triggered by a sequence of commands handled by the register set. The efficient organization of these command sequences determines how fast the operations can be executed. Drawing on 20 years of experience in the graphics industry and five generations of graphics controllers, Matrox's well-balanced design ensures maximum efficiency of data transfers to our accelerators through the register set.

How optimized is the use of the 32-bit PCI bus?
In some cases, the PCI bus can create a bottleneck for 128-bit and 64-bit graphics accelerators. For example, image loads can create a bottleneck because of the enormous amount of data that must be funneled to the graphics controller, through the 32-bit PCI bus. Therefore, it is very important to ensure the PCI controller is capable of taking the best advantage of the "scarce" PCI bandwidth for data-intensive functions like image loads, video playback and 3D operations. Matrox optimizes performance with a finely tuned controller to avoid any stalls in information passing through the PCI bus (also referred to as "zero wait state"). Performance is further optimized by Matrox's maximal use of the PCI bus' bursting capabilities, where information is sent in "bursts," or groups, instead of small amounts at a time. Furthermore, with the MGA-1064SG processor, Matrox has developed full scatter gather PCI bus mastering which actually offloads PCI data transfers from the host CPU during 3D functions (see Anatomy of a 3D accelerator for more details).

How fast does the graphics engine process operations?
Simply having 128 bits of information available at a time does not necessarily mean all of the information is useful. In many graphics drawing operations, for example drawing vertical lines (very common in spreadsheet and word processing applications), memory accesses are scattered in the frame buffer such that only one pixel is drawn at each memory access. In this case, since only one of 64 (in a 64-bit engine) or one of 128 (in a 128-bit engine) pixels are written at each memory access, a well-balanced 64-bit design is more effective at handling the data, rather than the datapath being merely over-sized.

How well does the graphics controller take advantage of special features in the memory subsystem?
Matrox has based its graphics engines on two powerful new memory types; WRAM and SGRAM. In addition to the incredible speed at which these memories can operate, they have many graphics specific features that Matrox's well-balanced controllers exploit to deliver maximum performance. These features include functions like dual color block write for text and pattern fills, used in all Windows-based applications, and fast aligned bit blits or multiple bank support for double buffering in 3D and video playback.

How are dead cycles minimized on the memory bus?
All memories have a certain theoretical bandwidth available. Although this maximum bandwidth cannot be reached in practice, the more times the graphics engine can access memory during a cycle, the more of that bandwidth it is actually using. The most important factor to consider is how few memory cycles are wasted by the graphics engine. Although having a wide aperture to memory can be beneficial, the engine needs to effectively handle memory accesses in order to take best advantage of the available bandwidth. Again, this is addressed in Matrox's memory controllers, which are designed to optimize their use of the large available bandwidth of WRAM and SGRAM.

Summary
The use of a numerical value to describe a technology such as 128-bit , or 512-bit, can mislead users into thinking that bigger is necessarily better. While some accelerators may use partial, or even complete 128-bit technology, the true measure of effectiveness rests in taking full advantage of available resources. Instead of falling into a "number of bits" marketing strategy, Matrox has chosen to continue designing advanced, well-balanced graphics technology, which is fine tuned to offer the best performance results in real world applications at competitive prices. With its fourth and fifth generation graphics accelerators, the Millennium and Mystique, Matrox uses its powerful memory technology, PCI bus mastering and its own, award-winning 64-bit graphics processor technology to generate unmatched acceleration — regardless of bandwidth.

* Testing conducted by Matrox Graphics Inc. on a Gateway PS-166 MHz with 256K cache and 1 MB of RAM. Matrox Millennium, Matrox Mystique and Number Nine Imagine 128 were configured with 4 MB of memory using driver revision 3.16, 3.16 and 2.09 respectively. STB Lightspeed 128 was configured with 2.25 MB of memory using driver version 1.22 M. STB could not be upgraded to support 1152x864 @ 24 bpp. Number Nine does not support 24 bit and therefore was tested at 32 bpp.


Bottom toolbar
Copyright © 1996 Matrox Graphics Inc. All rights reserved.
Send all questions and comments regarding this site's construction to webmaster@matrox.com