On the Feasibility of a Scalable Opto-Electronic CRCW Shared Memory

Paul Lukowicz, Walter F. Tichy
Universität Karlsruhe, Fakultät für Informatik, Germany
email: lukowicz@ira.uka.de, tichy@ira.uka.de

Abstract

In the paper we discuss the results of a feasibility study of an opto-electronic shared memory with concurrent read, concurrent write capability. Unlike previous such work we consider a true hardware shared memory rather than a simulation on a tightly, optically connected distributed memory computer. We describe an architecture that can be implemented using semiconductor based light modulator and VCSEL laser diode arrays and microlenses. We show how to solve two major problems faced by such a device: optical system complexity and parallel word level write consistency. An analysis of the fundamental performance constraints proves that in general the memory density decreases only linearly with the degree of concurrency as opposed to the quadratic decrease in electronic parallel access memory. Under certain circumstances the memory density can even be shown to be independent of the number of accessing processors. We argue that in principle, a memory with GBytes capacity and a latency of about 1 ns, accessed by thousands of processors is conceivable.

1 Introduction

The concurrent read concurrent write shared memory (CRCW SM) is a powerful concept for parallel processing. It allows simple and efficient implementation of a variety of parallel algorithms. It is also an essential component of the PRAM model of parallel computation [14]. Thus, much effort has been invested in its practical implementation. Unfortunately, attempts to build a uniform access time shared memory scalable beyond a small number of processors (PEs) have so far yielded unsatisfactory results. Efforts now concentrate on simulating shared memory on communication networks, using latency hiding and other techniques [17, 37]. Apart from the additional cost caused by the network, this introduces network management delays and sequentialization on memory module level.

In this paper we consider an alternative approach: using the advantages provided by optical communication data storage to implement a scalable hardware shared memory. We utilize the fact that in 3D free space optical technology a network with a bisection width \( B \) can be implemented on an area proportional to \( B^2 \) [11]. On the other hand any electronic implementation with an arbitrary but fixed number of layers requires an area \( O(B^3) \). We present the results of an extended feasibility study to show that despite being technologically challenging opto-electronic CRCW-SM (OCRCW SM) is an interesting possibility that should become feasible in the near future.

We first give a short overview of related work in section 2. Then in section 3 the principle of the OCRCW SM is described. From there we proceed to describe the overall system architecture in 4 and discuss implementation issues in section 5. The problem of parallel write word level consistency is considered in section 6. Finally performance estimates are given in section 7 and our conclusion and future work are presented.

2 Related Work

2.1 Electronic Multiport Memory

Multiport memory is an extension of common RAM that allows truly parallel access by providing multiple ports, each with its own addressing mechanism. Currently multiport memory modules with up to 4 ports are available. A larger number of ports has so far
not been implemented due to the cost of duplicating the addressing mechanism for each additional port. A recent study [13] shows the chip area needed to implement a $P$ port memory module with a given capacity to be proportional to $P^2$. It concludes that no more than 10 to 16 ports can be expected to be feasible in the foreseeable future.

2.2 SM Simulation

A huge amount of research in the area of parallel processing is devoted to the simulation of SM on distributed memory computers. According to the level of simulation the approaches can be classified as software (e.g. [42, 4, 18]), operating system (e.g. [16, 27, 2]) or hardware (e.g. [1, 25, 39]) based. Although some promising results have been achieved, a system with a performance comparable with the ideal SM for a wide range of problems and a large number of processors is still not in sight.

2.3 Optics in Parallel computing

Optics has long been considered superior to electronics in communication intensive applications. The use of optics for computing was pioneered among others by Huang [20, 19] and Lohmann [28, 29, 33]. So far, research on the use of optical technology in parallel computers has concentrated on the improvement of interconnection networks [36, 15, 43, 3]. Concepts have also been proposed for direct optical connections between electrical [23] or optical [31] memory banks. It was suggested that a fully connected network can realize a model of computation that is closer to the PRAM then conventional parallel computers [10]. Our work goes a step further by proposing a direct implementation of the full functionality of a CRCW parallel random access memory. Although the potential has been pointed out repeatedly (e.g. [30], [6]), little research has so far been performed in that area. In [5] a volume holographic concurrent memory was proposed and a small demonstrator (5 processor and 5 5 bit pages) implemented. In [24] architecture concepts for a parallel optical associative cache are discussed.

3 Basics

3.1 Problem Overview

A sequential memory system with the capacity of $M$ bit consist of $M$ memory elements and a 1 to $M$ connection allowing the processor to access each memory cell. A $M$ bit $P$ Processor shared memory system requires $M$ memory elements and a $P$ to $M$ connection. This is true for both an ideal, hardware SM system and a network based simulation. The difference between the two approaches lies in the degree of parallelism allowed by both the network and the memory elements. In an ideal SM system there is no limit on the number of processor that can concurrently access any single memory cell. The $P$ to $M$ interconnection network is equivalent to a superposition of $P$ 1 to $M$ networks present in a sequential memory system. It provides an independent channel between every processor end every memory cell. Thus the delay of a memory access operation performed by any given processor does not depend on the activity of the other processors. The main difficulty in implementing an ideal shared memory is the necessity for an independent physical connection for every communication channel. The area needed for the physical realization of the connections accounts for the $P^2$ bound on the area of a $P$ port VLSI memory mentioned in section 2.1.

3.2 Principle of Operation

The idea behind an opto-electronic CRCW SM (OCRCW SM) is that information stored using optically controlled, variable light absorption of memory pixels can be accessed concurrently by distinct light beams. The optical addressing is performed in free space by directing a beam of light towards the desired memory cell. Thus no physical connections are required. Multiple communication channels can coexist in the same space. In principle the addressing mechanism of a $P$ processor $M$ bit shared memory can easily be constructed by superposition of $P$ sequential $M$ bit addressing devices.

Bit Storage

In opto-electronic memory bits are stored in miniature light modulators (LMs). These are pixel like, multi-stage devices that can modify some property of incident light in accordance with their current state. In the following we consider LMs with two states: light absorbing and light transmitting that can be selected by applying an appropriate voltage. A simple common LM of that kind is a liquid crystal cell that constitutes a single pixel of a LCD display.

The state of a LM corresponds to a single bit of information. It can be read out by illuminating the LM with a beam of light and using a light detector (D)
to check whether it has been absorbed or transmitted. By illuminating a LM with multiple light beams and using multiple detectors a concurrent read operation can be performed (Figure 1).

To allow optical bit write an electro-optic memory cell (MC) can be constructed by combining a voltage controlled LM with two photodetectors. When illuminated, one photodetectors (set detector, SD) causes the modulator to be switched on, while the other one (clear detector, CD) switches it off (Figure 2). Thus a bit can be written by illuminating either then SD or the CD of the corresponding MC.

For a parallel write operation with some processors trying to set the bit and others trying to clear it the SD and CD can be wired to follow the majority. The bit is set depending on whether more light is incident on the SD or on the CD.

Opto-Electronic RAM

A random access memory based on the bit storage mechanism described above can be constructed using following components:

1. an array of electro optic MCs (memory plane, MP),
2. a beam steering device for read access (read unit, RU),
3. a beam steering device for write access (write unit, WU),
4. a light detector (D), and
5. an optical system (OS) to direct the beams towards the detector.

The basic principle is shown in Figure 3. The address of the desired memory location adr is translated by the RU into a light beam directed towards the corresponding location in the memory plane. After hitting the memory plane the light beam is directed towards the detector by the OS. For a write operation the WU unit sends a light beam towards the set or clear detector of the MC corresponding to the selected address.

**OCRCW SM**

To allow parallel access the above setup must be extended by an additional RU, WU, and D for every accessing processor (Figure 4). Since different light beams do not interact with each other, every processor can access the memory as if no others were present. Any PE can perform a read or write access on any memory location at any time realizing the full CR CW SM semantics.

For P processors the system performs $P \times 1 \times M$ mappings followed by a $M \times P$ mapping for parallel readout and a $P \times 2M$ mappings for parallel write.

### 3.3 Design Consideration

The major concerns that have to be addressed by any OCRCW SM design are optical system complexity and parallel write consistency on word level.

#### 3.3.1 System Complexity

There are fundamental physical constraints on the amount of information that can be stored optically in a given area and the number of optical communication channels in a given volume. Furthermore practical consideration impose a limit on the size of a single memory module.

**Memory and Connection density** The memory density is limited by the minimum radius $r$ of a point that can be resolved using a given optical system. For a wavelength $\lambda$, and a lens (or any other equivalent optical system) with a diameter $d$ and a focal length $f$ it is given by

$$r = 1.22\frac{\lambda f}{d}$$

The maximum number of optical communication channels of a system depends on the degree of
space variance \(^1\). For general interconnects [41] shows that
\[
\nu_{\nu,v} = \left(\frac{a}{\lambda}\right)^2 \quad (2)
\]
space invariant channels can be accommodated in a volume \(V = a^3\) at the wavelength \(\lambda\). For space variant channels this number is reduce by a square root:
\[
\nu_{\nu,v} = \sqrt{\nu_{\nu,v}} = \frac{a}{\lambda}. \quad (3)
\]
In practice energy concerns and crosstalk due to aligned components impose even stricter bounds (e.g. [11, 7]).

**Size**: The physical size of a memory plane is limited by lens quality considerations, alignment and stability problems and volume utilization.

High quality lenses with low \(\frac{f}{D}\) ratio required by formula (1) are very difficult to manufacture for large \(D\). Furthermore high quality imaging is much easier for small than for large fields of view. Larger systems are also more difficult to align, package and stabilize. Due to (3,2) the memory to volume ratio (which is proportional to the number of connections)
\[
\frac{M}{V} \sim \nu_{\nu,v} \sim \frac{a^2}{a^3} = \frac{1}{a}
\]
gets worse as the memory plane diameter \(a\) grows.

The system not only gets more difficult to build but also increasingly consists of empty ‘useless’ space.

As a consequence of the above we must limit the number of individual points that can be addressed by each PE to reduce the degree of space variance of the system. Furthermore the restriction on the physical size of the system means that OCRSW-SM a must be constructed using small, constant size modules, in much the same way as electronic memory is organized in chips and banks. This poses a problem of designing a module interconnection network that does not compromise the parallel access capability. To achieve this goals a multistage, modular addressing mechanism is introduced in section 4.

\(^1\)The space variance of an imaging system defines the irregularity of the system. It can be described as the number of permutations between the object and the image. In a lens based system the degree of space variance determines the minimum number of lenses necessary to implement the system.

---

**Figure 3**: A scheme of a RAM memory using a plane of memory cells (MC) consisting of a light modulator (LM), and two detectors (SD and CD), light deflecting read and write units (RU and WU) and a detector (D) as described in 3.2.

**Figure 4**: A scheme of the OCR-W-SM using a plane (MP) of memory cells (MC) and an individual read unit (RU), write unit (WU) and detector (D) for each PE [see 3.3].

### 3.3.2 Parallel Word Level Write Consistency

A protocol which determines the value of a bit through a majority rule as described in 3.2 can lead to inconsistent values on word level. With several PEs writing a \(b_w\) bit word each one is likely to win only on a fraction of bits. An attempt to concurrently write \(w_1 = 0101\), \(w_2 = 0110\), and \(w_3 = 0000\) to the memory word \(w\) for example leads to
\[
w = 0100 \neq w_1, w_2, w_3.
\]

Thus in general the result of a parallel word write operation would be undefined. In contrast many shared memory algorithms assume that one of the accessing PEs wins on word level and manages to correctly write its value.

This problem is solved by a backoff synchronization strategy at memory module level. In section 6 we describe a fast backoff protocol that realizes a priori\(^k\) concurrent write model.
4 Overall Architecture

The reduction of the space variance of the system is achieved by restricting optical addressing to \( p \) pages, \( b = M/P \) bits each instead of allowing individual optical addressing of each of the \( M \) bits. The selection of individual bits from a page is performed electronically. Section 4.2 describes a page oriented memory system that does not restrict the level of concurrency.

To allow the division of the memory into constant size modules without sacrificing concurrency a tree like network is proposed that allows full parallel access of all processors to all memory modules. Section 4.2 discusses a network architecture that uses the same technology as the memory modules and only marginally increases the cost and complexity of the system (as compared with the same amount of memory implemented in a single module).

4.1 Page Addressed Memory Modules

In the page oriented OCRCW-SM memory is accessed in two stages: an optical page selection stage and an electrical bit selection stage. The beam steering for optical page selection is performed by miniature laser diodes (LDs) combined with appropriate optics. The electrical stage can be implemented with a fast matrix addressing mechanism.

Reading a Page Each PE must be equipped with an array of \( b \) detectors (DA) instead of a single detector. A memory page is read optically by projecting it on the DA. For that purpose each processor is equipped with separate miniature laser diode (LD) for each page. An appropriate optical system makes sure that the desired page is illuminated when the corresponding LD is activated and that the transmitted light is directed towards the DA of the reading processor (Figure 5). In the DA the required bits are selected by the matrix addressing mechanism.

Writing a Page To write a memory location an appropriately set mask page is projected on the memory page containing the desired location (Figure 6). The mask page has two electrically controlled LMs for each memory cell of a page. One LM corresponds to the SD, the other one to the CD. To write a given bit of a page first the LMs of the sample page corresponding to the other bits are all turned off. Those bits are unaffected by the write operation. Of the LMs corresponding to the selected bit either the set LM or the clear LM is switched on. This determines whether the bit is to be cleared or set (Figure 7). The above scheme guarantees that multiple PEs can write different bits of a single page independently.

4.2 Module Interconnection

The module interconnection network must allow parallel access of all PEs to all \( m \) modules. Thus a separate communication channel is needed between every processor and every module. This is equivalent to a superposition of \( m \) independent \( 1 \times m \) interconnects. For a large number of processors and memory modules the enormous number of communication lines means that only an optical implementation is feasible. Various high density optical networks proposed
in the literature could be modified for that purpose. In this section we propose an opto-electronic, Banyan like network that we will call a tree of multiplexers network. The advantages of this approach is that this type of network can be constructed with components similar in functionality and architecture to the memory modules. Furthermore the network structure is modular and can easily be scaled by adding modules.

**Tree of Multiplexers Network** The processors are connected to the modules by a tree of multiplexers as shown in Figure 8. The tree has the height $h$ and the degree $D$. At each node there is a multiplexer with one input and $D$ outputs for each PE. The memory modules are located at the leaves and have $P$ IO channels, one for each PE. Thus each PE has an independent connection to each memory module.

**Complexity** In addition to the $m = D^h$ memory modules the above architecture contains $n_{\text{mult}}$

$$n_{\text{mult}} = \sum_{i=0}^{h-1} D^i \leq 2 \cdot D^{h-1} \leq m$$

multiplexers. Each multiplexer performs $P \times D$ mappings directing the input of each PE onto one of $D$ alternative outputs. This is essentially a simplified version of the functionality contained in a single module that performs $P$ independent $1 \times p$ mappings with $p \gg D$.

**Cost** We assume that the cost of building a single multiplexer module is at most equal to the cost of a memory module. Dividing the OCRCW SM into modules in the above manner increases the number of components in the system by at most

$$\text{cost} \leq \frac{D^h + 2 \cdot D^{h-1}}{D^h} = \frac{D + 2}{D}.$$ 

The engineering effort does not increase since the additional components are simplified versions of the memory modules. The multiplexer stages have no influence on the number of switching operations needed for a memory access since they are part of the standard address decoding process.

5 Implementation Issues

In this section we briefly sketch possible optical implementations for the memory modules and the module interconnection network. We also discuss the availability and performance of the opto-electronic components required for the realization of the system. The purpose of this section is to provide a basis for the performance analysis and feasibility estimation. A detailed consideration of the best technical solutions is beyond the scope of the feasibility study and is the subject of current and future research.

5.1 Optical Memory Module System

For the sake of simplicity the following description will be restricted to a 2 dimensional system. The number of pages $p$ will be assumed to be equal to the number of processors $P$.

An exemplary read and write system for 3 processors and 3 pages is shown in figures 9 and 10. The imaging of the memory pages on the detectors in the read system and the mask pages on the memory pages in the write system are performed by an array of $2p-1$ imaging lenses $I_{1...2p-1}$ lenses each used at $2f$. The image of the memory produced by each lens in the detector plane is shifted vertically by a distance proportional to the vertical offset of the lens from the center of the memory plane. This can be used to make sure that for every $1 \leq i \leq p = P \geq j \geq 1$ there is an imaging lens $I_k(i,j)$ that images page $i$ on detector $j$ (or mask page $j$ on page $i$ in the write system). To illuminate a selected memory page (or the mask page) each processor has a beam steering mechanism that makes sure that all light passes only through the appropriate lens $I_k(i,j)$. The beam steering mechanism of every processor consists of an array of $p$ laser diodes (LDs) located off axis in front of a collimating lens. To guarantee the correct illumination angle in the read setup an additional pair of illuminating lenses $II$ separated $f^{\parallel}$ is necessary in front of each memory page.

5.1.1 Basic Lens Functionality

The system described below is composed convex lenses, each characterized by its diameter $d$ and focal length $f$. In the following paragraphs we will refer to the horizontal axis of the lens as the optical axis. We will also use a coordinate system (lens coordinates) with the origin located at the center of the lens. The planes perpendicular to the optical axis, located at the distance $f$ behind and in front of the lens are called the focal planes (back focal plane and front focal plane...
The read setup consists of an illumination and an imaging system. The former has \( P \) read units (\( RU_{1...P} \)) at \( x^{RU} \) followed by \( p \) illumination lenses \( l_{1...p} \) at \( x^{RI} \) in front of the MP at \( x^{MP} \). The MP has \( p \) pages \( p_{1...p} \) each \( d_{p} \) wide spaced by \( \delta_{p} \). The page spacing is chosen to be equal to the size of an individual page

\[
\delta_{p} = d_{p}
\]  

(10)

The imaging system behind the MP is composed of an array of \( 2p - 1 \) imaging lenses \( R I_{1...2p-1} \) at \( x^{I} \) and \( P \) detector arrays \( DA_{1...P} \) at \( x^{DA} \).

**Imaging System** The imaging lens array is located at

\[
x^{I} = x^{MP} + 2f^{I} = x^{DA} - 2f^{I}
\]  

(11)

between the MP and the DAs. Each lens has a diameter \( d_{l} = d_{p} \) equal to the diameter of a single page and is located either directly opposite a page or between two pages. The size \( d_{d} \) and spacing \( \delta_{d} \) of the DA directly corresponds to the MP with

\[
d_{d} = d_{p} = \delta_{d} = \delta_{p}.
\]  

(12)

Thus the vertical coordinates of the centers of the pages \( y_{i}^{p} \) the \( I \)'s \( y_{i}^{I} \) and the DAs \( y_{i}^{DA} \) are

\[
y_{i}^{p} = y_{i+1}^{p} = y_{i}^{DA} - d_{l} = y_{i}^{DA}
\]  

(13)

\[
y_{i}^{p} = y_{k}^{DA} - y_{k}^{I} = \frac{x^{MP} - x^{DA}}{x^{MP} - x^{DA}}
\]  

(14)

In the lens coordinate system of the lens \( I_{i} \) this leads to

\[
y_{i}^{p} = -y_{k}^{DA}
\]  

(15)

\[
k + j = 2i + 1
\]  

(16)

The above setup results in \( I_{(k+j+1)/2} \) imaging page \( p_{k} \) on the detector array \( DA_{j} \). Since there are twice as many lenses as pages or DAs equation 16 can be satisfied for any pair \( 1 \leq i, j \leq p = P \). Processor \( P_{j} \) can thus read the page \( p_{k} \) by illuminating it in such a way that all light has to pass through \( I_{(k+j+1)/2} \). This is accomplished by a parallel beam of light passing through the page towards the corresponding DA traveling at the angle \( \theta_{k,j} \)

\[
\tan \theta_{k,j} = \frac{y_{k}^{DA} - y_{k}^{I}}{x^{MP} - x^{DA}}
\]  

(17)

**Illumination System** For every processor \( j \) and every page \( k \) the task of the illumination system is to direct a parallel beam of light towards \( p_{k} \) in such a way that it satisfies condition (17). This is done in two stages. First a parallel beam of light is directed by the read unit \( RU_{j} \) towards the focal plane of \( I_{k} \) at the angle \( 180 - \theta_{k,j} \). According to 8 a light beam at the required angle can be generated by a laser diode \( L_{j,k} \) located at the focal plane of \( RU_{j} \) at \( y_{k}^{LD} \) with

\[
y_{k}^{LD} = \Delta_{j}^{y} + y_{k}^{I}
\]  

(18)

\[
\Delta_{j}^{y} = \tan (180 - \theta_{k,j}) \cdot f^{RU} = \frac{f^{RU} (y_{k}^{DA} - y_{k}^{I})}{x^{MP} - x^{DA}}
\]  

(19)
For the light beam to illuminate the focal plane of \( l_k \) at \((x^{II}, y^p_k)\)

\[
\frac{f^{RU}}{\Delta y_k} = \frac{y^p_k - y^{RU}_k}{(x^{II} - f^{II})} \tag{20}
\]

with

\[
y^{RU}_k = y^{II}_k \tag{21}
\]

must be satisfied. Assuming each illumination unit to be located opposite a memory page

\[
y^{II}_k = y^p_k = y^{DA}_k = y^{RU}_k \tag{22}
\]

together with (12) and (12) leads to the distance between the read units and the front focal plane of the illumination units to be equal to the distance between the memory plane and the imaging lenses.

\[
(x^{II} - f^{II}) = x^I - x^{MP} = x^{DA} - x^I = 2f^{II} \tag{23}
\]

Each illumination unit has two identical lenses with the focal length \( f^{II} \) separated by \( 2f^{II} \). The \( l\)s reverse the vertical direction of the incoming beam and direct it toward the memory page \( k \) located \( f^{II} \) behind the second lens at

\[
x^{MP} = x^{II} + 3f^{II} \tag{24}
\]

at the desired angle \( \theta_{k,j} \).

### 5.1.3 Writing a Page

In a sense the write process requires the reverse functionality of the read system. For read access each processor must project one of the many memory pages on

its own detector array. This can be viewed as a \( p \) to \( 1 \) mapping. In the write system on the other hand every processor must be able to project its particular mask page on any given page resulting in a \( 1 \) to \( p \) mapping.

The write setup can be derived from the read setup by replacing the detector arrays with mask pages and placing the horizontally flipped read units at \( f^{RU} \) behind them as shown on Figure 10. The combination of the mask page and the read unit constitutes a write unit \( WU \). This way the imaging system is operated 'in reverse'. Thus the same lens \( l_{(k+j-1)/2} \) that images memory page \( p_k \) on the detector array \( DA_j \) images the mask page \( MP_j \) on the memory page \( p_k \) (light paths
are reversible). For the light to reach read $I_{(k+j-1)/2}$ $MP_j$ must be illuminated at the same angle $\theta_{k,j}$ as $p_k$ during read access by processor $P_j$. According to (??) light directed towards page $p_k$ by $RU_j$ leaves the read unit at the angle $180 - \theta_{k,j}$. As a consequence the above requirement is automatically satisfied when horizontally flipped read units are placed at the distance $f_{RU}$ behind the corresponding mask pages.

5.2 Module Interconnection Network

This section shows how the intermodule network described in 4.2 can be implemented based on the technology and concepts used for the memory modules. It uses a 1 detector $D$ LDs smart pixels for address decoding and a shifted lens system similar to the page illumination setup of the memory modules for beam steering to the next stage. For the geometrical layout a 3D H-tree topology is chosen.

5.2.1 Multiplexer Modules

We assume that data from the processor to the memory is transferred in packets proceeded by a header containing the address. Each stage decodes part of the address and sends the data accompanied by the rest of the address to the next stage. To perform the decoding each stage has a combination of a detector, decoding circuits and $D$ LDs for every processor. The LDs are combined with an optical system that makes sure that each LD sends the data to the corresponding detector of a different next stage module.

Multiplexer Optical System The multiplexer optical system shown in Figure 11 is a variation of the setup for illuminating memory pages in the read process described in the previous section. It is based on the fact that laser diode $LD_{ij}$ used by $RU_i$ to illuminate $I_l$ is imaged by the first lens of $I_l$ at

$$y_{ij}^{LD} = y_l^R - \Delta_{ij}^{LD}$$

(25)

in the back focal plane of that lens. The multiplexer optical system thus consist of an array of $P$ addressing units $AU_{1...P}$ at $x^M$ followed by an array of $D$ focusing lenses $FL_{1...D}$ at $x^{FL}$ and an array of $D$ next stage receiver arrays $RA_{1...D}$ at

$$x^{RA} = x^{FL} + f^{FL}.$$ 

(26)

The AUs are identical with the RUs except for the number of LDs, which is $D$ instead of $p$. The distance between the AUs and the FUs is determined by a relation similar to (20):

$$\frac{f_{AU}}{\Delta y} = \frac{y_{k}^{FL} - y_{ij}^{AU}}{(x^{FL} - f^{FL})}$$

(27)

Taking $f_{AU} = f_{RU}$, $f_{L} = f^{FL}$ and $y_{i}^{AU} = y_{i}^{RU}$ makes the dimensions of the multiplexer modules and memory modules equal.

At each stage the data going from the memory modules to the processors arrives from one of the $D$ modules of the next stage and has to be send to the proceeding stage. This can be accomplished without any electronic address decoding by a purely optical setup using beam splitters.

5.2.2 Connecting the Multiplexer Modules

The objectives of the design of the geometrical layout of the network are to find a technically feasible way of implementing the huge number of intermodule connections and to realize the highest possible memory density. The proposed solution is based on a H-tree topology which is often used in VLSI-design (see Figure 12). It can be implemented in 2D for binary and in 3D for quad trees. At each node the connections to
the sons are rectangular to connection from the parent. The length of the connections between the nodes is reduced by half with every stage.

To realize a 3D quad tree each multiplexer module is a cube with the front face connected to the predecessor and the sides connected to the next stage modules. Using simple orthogonal mirrors the system described in the previous section can be transformed to fulfill this requirement. The level dependent length of connection lines can be implemented using additional connection modules consisting of a single lens with a focal length 1/4 of the cubicles length.

The above schemes guarantees a very good utilization of the available volume. Alignment is required between neighboring modules only. Since the density of communication channels is much smaller then in the memory modules no critical alignment problems arise.

### 5.3 Optoelectronic Components

Three types of opto-electronic components are required to build an optical CRCW memory proposed in this paper: light sources, light modulators and light detectors. Of interest are only devices that can be built and miniaturized in semiconductor technology and can potentially be operated at GHz frequencies at low driving energies. A comprehensive overview of such devices and their operating principles can be found in [40].

#### Lasers

Much work has been performed in the area of semiconductor laser arrays. Reference [22] reports a GaAs MQW laser array with a device density of 2x10⁶/cm² and 0.5mW per laser. The experimental device had an active area of 15mm² and was operated at 230MHz. This type of device was used as a read out element for holographic page memory [34]. Recently LD arrays of up to 256 elements each 15μm in diameter with over 3mW output and up to 1GHz modulation frequency became commercially available [35].

#### Light Modulators

Light modulators are the most critical opto-electronic components. Devices allowing MHz or GHz modulation frequencies at low energies became available only recently with the advance of the MQW (multiple quantum well) SEED (self electro optic device) technology [32]. They exploit electrically induced shifts of the absorption peak of excitons (electron-whole pseudoatoms) confined in narrow potential wells by thin layers of different semiconductors. Theoretically switching times of several picosecond can be achieved in MQW SEEDs modulators.

#### Light Detectors

A photodetector can be characterized by two factors: the dependence of the photocurrent on the absorbed optical energy (efficiency) and the statistical "noise" current. For reliable signal transmission photocurrents an order of magnitude greater than the noise current must be generated leading to a minimum reliably detectable energy. Silicon avalanche photodiodes are cited with bit rates of 1GHz at about 100 nW optical power or 100 MHz at less than 10 nW.

### 6 Parallel Word Level Write Consistency

It is important that a CRCW SM allows parallel write on word level as well as on bit level. However a majority decision on bit level can lead to inconsistent results on word level. In case of several processors trying to write a different value w into the same memory word the result of the operation does not correspond to any of the original words. To avoid this difficulty we propose an iterative backoff strategy on memory module level. It relies on the ability of a memory word to recognize a conflict, signal it, and use the bₜ bits of w to make all but \(\frac{P}{bₜ}\) processors back off in every iteration.

#### 6.1 Backoff Mechanism

**Memory Extensions** To signal a parallel write conflict an additional control bit (conflict bit, CB) is added to each memory word. In addition the SDs and CDS of the bits of w must be wired in such a way, that

1. a simultaneous signal occurring at the SD and CD of any bit (=an attempt to write conflicting values) can be detected and signalled by setting the CB and
2. there is a mode of operation, activated by a set CB, which unambiguously selects and sets a single bit out of the set off all bits of w with an active SD.

**WU Extensions** Every processor is assigned a unique number \(N_{PE}\). This number is coded to a basis of \(bₜ\). For \(bₜ = 8\) octal representation would be used. Each digit of \(N_{PE}\) is represented by \(bₜ\) bits of which only the one corresponding to the value of this digit is set. The octal, two digit number 13 would thus be represented by the digits \(d₁ = 01000000, d₂ = 00010000\).
Backoff Protocol A write operation by a subset of $P_{w,r} \leq P$ processors $\mathcal{P}$ on a memory word $\mathbf{w}$ is performed as described below:

1. The iteration number $k$ is set to 0.
2. Each processor in $\mathcal{P}$ writes its value of $\mathbf{w}$ to $\mathbf{w}$.
3. If a conflict occurs then:
   (a) the CB bit of $\mathbf{w}$ is set,
   (b) all other bits of $\mathbf{w}$ are cleared,
4. Each processor in $\mathcal{P}$ reads $\mathbf{w}$ to see if CB is set.
5. If CB is set then
   (a) each processor in $\mathcal{P}$ writes the $k$th digit $d_k$ of its $N_{PE}$ (in the $b_w$ bit representation described above) to $\mathbf{w}$,
   (b) of all the $d_k$s written by the processors in $\mathcal{P}$ one is selected by the priority selection mechanism and the corresponding bit is set, while all other bits of $\mathbf{w}$ remain off,
   (c) each processor in $\mathcal{P}$ reads $\mathbf{w}$, compares it to its $d_k$ and checks if CB is set,
   (d) all processors with $d_k \neq w$ leave $\mathcal{P}$.
   (e) If CB is set then $k$ is increased and all processors in $\mathcal{P}$ go back to step 5a.
6. The last processor remaining in $\mathcal{P}$ writes $\mathbf{w}$ to $\mathbf{w}$.

6.2 Cost

The backoff strategy makes sure that the number of processors $P_k$ remaining in the race after iteration $k$ is at most

$$P_k = \frac{P_{k-1}}{b_w} = \frac{P}{(b_w)^k}.$$ 

Thus the maximum number of iterations needed to solve a write conflict is $\log_{b_w}(P)$. For $b_w = 32$ this gives a maximum 3 iteration steps for up to 32768 processor and 4 iteration steps for up to 1048576 processors. Since the operation involves a single module rather then the whole memory the actual impact of a parallel write conflict on the write access speed is even less then the number of backoff iterations.

7 Evaluation

In this section we discuss the limits on the performance of the CRCW-SM described in the previous sections. Our purpose is not to provide a strict, exact analysis which is beyond the scope of this paper. Instead we want to show that a system could be feasible in the near future.

7.1 Performance Limits

The performance of the memory is given by the memory density $\psi^M$ determined by the minimal allowed bit radius $r_{bit}$, the degree of concurrency $P$, the maximal feasible capacity of the memory $M$ and the access latency $L$. The above are determined by the optical limits on components density as well as the performance and fabrication density of the optoelectronic components.

**Optical Memory Module System** The fundamental, implementation independent limits on the memory density and concurrency are given by equations (2) (3). The degree of space variance of the system is $max(P, p)$ the maximum of the number of processors and pages $max(P, p)$. This leads to:

$$max(P, p) = \frac{D^{MP}}{\lambda}. \tag{28}$$

with $D^{MP}$ being the diameter of the memory plane. The imaging of memory pages on the detectors is space invariant. Thus the maximum number of bits per page is given by

$$b = \left( \frac{d}{\lambda} \right)^2. \tag{29}$$

In the setup proposed in section 5 the maximum memory density $\psi^M$ determined by the minimal radius of a single bit $r_{bit}$ can be derived by applying equation 1 to the imaging lenses:

$$r_{bit} = 1.22\lambda f^I. \tag{30}$$

The above equation suggests no direct dependence of the the memory density on the number of accessing processors. However one has to keep in mind that for every accessing processor 2 additional imaging lenses must be added. In a three dimensional system this leads to

$$r_{bit} = 1.22\lambda f^I \sqrt{P} \tag{31}$$

Thus for a fixed size of the memory plane and constant $f^I$ the memory density linearly decreases with the number of accessing processors

$$\psi^M = \frac{1}{r_{bit}} \sim \frac{1}{P}. \tag{32}$$

To keep the memory density constant either the memory plane size must be kept proportional to $\sqrt{P}$ (meaning that $M$ must increase linearly with the number of processors) or the focal length $f^I$ must be decreased.

$$\psi^M = O(1), \quad \frac{f^I}{D^{MP}} \sim \frac{1}{\sqrt{P}} \tag{33}$$
The problem with this approach is that it causes the maximum angle of the light beams in the system $\theta_{ij}^{\text{max}}$ to grow as a function of the number of processors:

$$\tan \theta_{ij}^{\text{max}} = \frac{2 f_i}{D MP} \sim \frac{1}{\sqrt{P}}$$

**Memory Module Interconnection Network** The optical system of the memory module interconnection network is directly derived from the memory module system. Since $D$ is much smaller then $p$ a multiplexer module with a size not exceeding that of a memory module can easily be implemented. In case of a 3D quad tree a multiplexer module for 400 processors would require $p/D = P/D = 100$ smaller connection density then a memory module.

**Opto Electronic Components Constraints** All the electronic components discussed in section 5.3 have been fabricated in arrays with densities of more then $16^6$ elements per $mm^2$ on semiconductor wafers. They can be operated at frequencies of up to 10 GHz. Current MQW devices (modulators and VCSEL LDs) operate in the at the wavelength $\lambda \geq 850nm$. Achieving other wavelengths requires combining different semiconductors which poses some technological problems. However recently first VCSEL emitting in the ultraviolet with $\lambda \sim 200nm$ were reported.

**Overall Performance** The fundamental limit on the density of an OCRCW-SM arises from the limitation of the optical system. Assuming a wavelength of $1\mu m$, $2 f_i/D MP = 0.1$, lenses with $f_i/d_i = 2$ and a two detector one modulator bit structure about 100KBit could in theory be accommodated on a $cm^2$. According to (33) about 400 processors could access such a memory module concurrently. Using the 3D H-tree of multiplexers network some 400 such modules could be accommodated in a cube with $10cm$ side length. Reducing the wavelength to $500nm$ (green light) would result in around 400 KBit memory module capacity or 1600 accessing processors If it were possible to use UV-light the optical system would allow up to 1GBit and thousands of processors. On module level the latency and bit rate are limited by the performance of the optoelectronic components rather then the fundamental limit on the space and time bandwidth product. While energy and heat consideration might make GHz systems extremely difficult to implement, in theory module level latencies of $100ps$ are conceivable. Such high device speeds would allow to use time multiplexing to increase the capacity by a factor of $10^5$ or more while retaining acceptable module access latency. The overall latency must take into account the optical path length through the multiplexer network. For a $10^3cm$ cubicle the path to the modules is approx. 20cm leading to a delay of about 1ns. Including the delay caused by the multiplexing stages (also assumed to work at up the maximal rate of some 10GHz) gives an overall latency of less then 2ns

$$L \leq 2ns$$

7.2 Feasibility

Looking at the feasibility of the OCRCW-SM proposed in this paper we have to consider two aspects: the difference between fundamental and practical performance limits and the fact that opto-electronics is an emerging technology that in many areas is still far from achieving its full potential.

**Fundamental vs. Technical Limits** Important practical aspects that were left out of the fundamental limits analysis are crosstalk due to imperfections of the optical systems, power considerations and heat problems. The impact of the above factors depends on the exact technical implementation of the system and is difficult to judge. In [12] the authors estimate that taking crosstalk into account reduce the maximum interconnection density by a factor of about 10.

**Limits of Current Technology** The components necessary for the implementation of the OCRCW-SM described in this paper have all been demonstrated by different groups in laboratory experiments [22, 26]. Some are even commercially available [35]. They have been combined to form complex systems on an optical bench in various experiments [8, 38]. The integration of similar (less complex) systems in compact modules has also been reported [9, 21]. However currently, the complexity of such systems is still below the complexity of an OCRCW-SM large enough for practical purposes. Nevertheless the required technology is emerging with a large community of scientist working on demonstrating larger and more complex system.

7.3 Opto-Electronic vs. Electronic Shared Memory

As discussed in 2.1 the need for physically distinct connections and the 2D nature of electronic VLSI circuits lead to a $O(P^2)$ memory density reduction for concurrently accessible electronic memory devices. For the OCRCW-SM proposed in this paper the cost of concurrent access by $P$ processors is $O(P)$ for fixed and $O(1)$ for $P$ dependent maximal illumination angle. Thus as far as concurrency is concerned optical
memory proposed in this paper has a considerable fundamental advantage over electronic memory. While it is now widely acknowledge that electronic concurrent memory is not suitable more then a few processors the above suggests that OCRCW-SM might well be useful for large scale parallel systems. Particularly interesting is the possibility of using small amounts of OCRCW-SM together with conventional memory as parallel 'caches'.

8 Conclusion and Future Work

We have presented a design for a scalable OCRCW SM that overcomes the optical system complexity problem through modularization and multistage addressing. An optical system based on LD arrays and lenses was proposed to implement this architecture. A synchronization scheme was discussed that realizes word consistent concurrent write operation with negligible overhead even for large number of processors. A study of the fundamental performance constraints has shown that for the above memory system the cost of concurrent access by $P$ processors is $O(P)$ for fixed and $O(1)$ for $P$ dependent maximal illumination angle as opposed to the $O(P^2)$ cost in electronic multiport memory. At the same time the capacity and latency of the memory can in theory compete with conventional electronic memory.

We conclude that the OCRCW SM is an interesting alternative to the simulation of shared memory. However a lot of research must yet conducted before even a small practical device can be build. Especially we must keep in mind that an implementation of a large system must deal with aberrations, crosstalk, stability and power consumption aspects that were not considered in this paper. Problems that have to be tackled in the future include:

1. a detailed study of possible, technically feasible implementations of the optical system
2. implementation of proof of principle lab models
3. a study of the practical and technological performance limits take take into account optical system aberrations as well as power and heat dissipation
4. a study of possible computer architectures using limited amounts of OCRCW-SM in combination with conventional memory
5. simulations comparing the performance of machines using OCRCW-SM varying in size and performance to more conventional parallel machines.

References


