Special Purpose Volume Rendering Hardware

The high computational cost of direct volume rendering makes it difficult for sequential implementations and general-purpose computers to deliver the targeted level of performance. This situation is aggravated by the continuing trend toward higher and higher resolution datasets. For example, to render a dataset of 10243 16-bit voxels at 30 frames per second requires 2 Gbytes of storage, a memory transfer rate of 60 Gbytes per second, and approximately 300 billion instructions per second, assuming merely 10 instructions per voxel per projection. To address this challenge, researchers have tried to achieve interactive display rates on supercomputers and massively parallel architectures [45,59-61,75,90]. However, most algorithms require very little repeated computation on each voxel, and data movement actually accounts for a significant portion of the overall performance overhead. Today's commercial supercomputer memory systems do not have and will not in the near future have adequate latency and memory bandwidth for efficiently transferring the required large amounts of data. Furthermore, supercomputers seldom contain frame buffers and, because of their high cost, are frequently shared by many users.

In the same way that the special requirements of traditional computer graphics lead to high-performance graphics engines, volume visualization naturally lends itself to special-purpose volume renderers that separate real-time image generation from general-purpose processing. This allows for stand-alone visualization environments that help scientists to interactively view their data on a single user workstation, either augmented by a volume rendering accelerator or connected to a dedicated visualization server. Furthermore, a volume rendering engine integrated in a graphics workstation is a natural extension of raster-based systems into 3D volume visualization.

Several researchers have proposed special-purpose volume rendering architectures [30, Chapter 6] [14,25,28, 44,49,67,68,83]. Most recent research focuses on accelerators for ray-casting of regular datasets. Ray-casting offers room for algorithmic improvements while still allowing for high image quality. Recent architectures [22] include VOGUE, VIRIM, and most significantly Cube. VOGUE [34], a modular add-on accelerator, is estimated to achieve 2.5 frames per second for 2563 datasets. For each pixel a ray is defined by the host computer and sent to the accelerator. The VOGUE module autonomously processes the complete ray, consisting of evenly spaced resampling locations, and returns the final pixel color of that ray to the host. Several VOGUE modules can be combined to yield higher performance implementations. For example, to achieve 20 projections per second of 5123 datasets requires 64 boards and a 5.2 GB per second ring-connected cubic network. VIRIM [17] is a flexible and programmable ray-casting engine. The hardware consists of two separate units, the first being responsible for 3D resampling of the volume using lookup tables to implement different interpolation schemes. The second unit performs the ray-casting through the resampled dataset according to user-programmable lighting and viewing parameters. The underlying ray-casting model allows for arbitrary parallel and perspective projections and shadows. An existing hardware implementation for the visualization of 256 x 256 x 128 datasets at 10 frames per second requires 16 processing boards.

The Cube project aims toward the realization of highperformance volume rendering systems for large datasets and has pioneered several hardware architectures. Cube-1, a firstgeneration hardware prototype, was based on a specially interleaved memory organization [29], which has also been used in all subsequent generations of the Cube architecture. This interleaving of the n3 voxel enables conflict-free access to any ray parallel to a main axis of n voxels. A fully operational printed circuit board implementation of Cube-1 is capable of generating orthographic projections of 163 datasets from a finite number of predetermined directions in real-time. Cube-2 was a single-chip VLSI implementation of this prototype [3].

To achieve higher performance and further reduce critical memory access bottleneck, Cube-3 introduced several new concepts [51,52,54]. A high-speed global communication network aligns and distributes voxels from the memory to several parallel processing units, and a circular cross-linked binary tree of voxel combination units composites all samples into the final pixel color. Estimated performance for arbitrary parallel and perspective projections is 30 frames per second for 5123 datasets. Cube-4 [26,53,55] has only simple and local interconnections, thereby allowing for easy scalability of performance. Instead of processing individual rays, Cube-4 manipulates a volume slice at a time. As a result, the rendering pipeline is directly connected to the memory. Accumulating compositors replace the binary compositing tree. A pixel bus collects and aligns pixel output from the compositors. Cube-4 is easily scalable to a very high resolution of 10243 16-bit voxels and true real-time performance implementations of 30 frames per second.

Enhancing the Cube-4 architecture, Mitsubishi Electric has derived EM-Cube (Enhanced Memory Cube-4), a system based on EM-Cube, which consists of a PCI card with four volume rendering chips, four 64 Mbit SDRAMs to hold the volume data, and four SRAMs to capture the rendered image [50]. The primary innovation of EM-Cube is the block-skewed memory, where volume memory is organized in subcubes (blocks) in such a way that all voxels of a block are stored linearly in the same DRAM page. EM-Cube has been further developed into a commercial product where a volume rendering chip, called vg500, has been developed by Mitsubishi. It computes 500 million interpolated, Phong-illuminated, composited samples per second. The vg500 is the heart of a VolumePro PC card consisting of one vg500 and configurable standard SDRAM memory architectures. The first generation, available in 1999, supports rendering of a rectangular data set up to 256 x 256 x 256 12-bit voxels, in real-time 30 frames/sec [56].

Simultaneously, Japan Radio Co. has enhanced Cube-4 and developed a special-purpose architecture U-Cube. U-Cube is specifically appropriate for real-time volume rendering of 3D ultrasound data.

The choice of whether one adopts a general- or specialpurpose solution to volume rendering depends upon the circumstances. If maximum flexibility is required, generalpurpose appears to be the best way to proceed. However, an important feature of graphics accelerators is that they are integrated into a much larger environment where software can shape the form of input and output data, thereby providing the additional flexibility that is needed. A good example is the relationship between the needs of conventional computer graphics and special-purpose graphics hardware. Nobody would dispute the necessity for polygon graphics acceleration despite its obvious limitations. The exact same argument can be made for special-purpose volume rendering architectures.

0 0

Post a comment