# Low Cost Network on Chip Router Design for Torus Topology

Bouraoui Chemli $^{\dagger}$  and Abdelkrim Zitouni  $^{\dagger\dagger}$ 

<sup>†</sup>Electronics and Microelectronics Laboratory, Faculty of Sciences of Monastir, University of Monastir, Tunisia <sup>††</sup>College of Education in Jubail, University of Dammam, KSA

#### Summary

Network on chip (NoC) has emerged as a good solution to enhance the communication structures for complex System on Chip (SoC). Unlike bus based system, NoC integrate hundreds or thousands of intellectual properties (IPs) like processors, memories or other custom design on a single chip. This work aims at providing comparison and performance analysis of NoC router. The proposal supports the torus topology and implements the negative-first routing algorithm to avoid deadlocks. We describe the router architecture which composed of the input module, the switch allocator and the crossbar traversal. Results are presented and compared with other works in terms of maximal clock frequency, area, power consumption and peak performance.

#### Key words:

Architecture; Turn model; NoC; Router; topology.

## **1. Introduction**

Recently, NoC was the subject of many researcher topics. It implements routing calculation and packet switching technique to decrease hardware cost and power consumption and increase scalability and system performance [1]. NoC can handle the communication of hundreds of cores and allows several transactions concurrently [2]. Hence, NoC is presented as a better candidate for eventual on chip communication. It gives more flexibility, high scalability and low latency compared to conventional bus technology [3]. Mainly, NoC architectures consist of three parts which are the router to forward packets across the network, the network interface to allow the access to the network and the link to interconnect the NoC parts [4]. The connection of the different parts affects the data transmission capability. Thus, NoC topology does not only play an important role in determining the network latency throughput area but also keeps scalability and reusability of the NoC design. In this dissertation, we present scalable and flexible router architecture for torus topology NoC. We therefore compare and analyze the performance of the proposal with other common topologies.

This paper is organized as follows; Section 2 deals with related works. Section 3 tackles the topology of the NoC. Section 4 describes the routing algorithm. Section 5 presents the pipeline stages of the proposed router in detail.

Section 6 provides the performance results to conclude the paper in section 7.

## 2. Related work

Several research topics are conducted to NoC architecture. This section introduces and discusses some of the proposed NoC on scientific literature.

In [5], authors presented NoC architecture which uses the mesh topology, implements the wormhole switching and adopts the stall-and-go flow control. In order to decrease latency and boost throughput, they implement Short-Pass-Link customization. However they have to focus on reducing both the hardware cost and the power consumption. In [6], authors suggested scalable packet based router architecture which dynamically transfers and manages many transactions concurrently. Furthermore, they use their router in mesh and torus NoC. Their design suffers, however, from low throughput and high power consumption. In [7], authors presented a new NoC architecture. They used mesh topology, XY routing algorithm and credit-based flow control. In order to support and differentiate between packets QoS, they implement a dynamic arbiter. Despite the fact that their architecture reduced the average latency of network, they did not study the deadlock situations which are critical for NoC design. In [8], writers offered a flexible router design for mesh NoC. The synthesis of their design in ASIC showed a promising result when it comes to power consumption and area. Nonetheless, using the handshaking flow control can increase calculation time. It also seems better to use a credit-based flow control. In [9], Mesh/Torus NoC design was introduced. They use input queuing, XY routing algorithm and virtual channels. In spite of their design being deadlock free and implementing a simple routing mechanism, it suffers from area overhead and lack of scalability. In [10], we proposed router architecture for use in mesh NoC topology. We applied the deadlock-free negative-first which is a turn-model based routing algorithm. The router uses dynamic arbiter as well to provide QoS and fairly serve packets. Still, the router pipeline stages suffer from dependency which increases latency and hardware cost.

Manuscript received May 5, 2017 Manuscript revised May 20, 2017

Consequently, we suggest the router architecture for hierarchical mesh NoC topology. The proposal has been prototyped on virtex5 and virtex6 FPGA. We provide performance comparison with three common topologies, diagonal-mesh, 2D mesh and 3D mesh.

## 3. Topology of the proposed NoC

As illustrated in Figure 2, for the proposed NoC we use a regular topology which is a 4 by 4 size torus, the well-known wormhole switching as a switching technique and the handshaking as a flow control. We adopt a unique address to each router defined in XY coordinates. Routers are connected to each other via ports, regarding its position in the network. Each router can have a maximum of five bidirectional ports four of which are connected to the adjacent routers at every direction (north, east, south and west). One port is connected to the local IP. For both the power consumption and the area of the chip to be reduced, we have to deactivate any ports with no connection to other routers.



Fig. 1 4x4 NoC hierarchical mesh topology.

## 4. Routing algorithm

We adopted the negative-first, simple-to-implement, routing algorithm for the proposed design. Unlike any other routing algorithm, it avoids deadlock and livelock without using the virtual channels that cause a hardware complexity overhead. Each input port has its own routing block where the routing calculation is performed. It compares the current router address with the destination address of the flit to define the destination port:

| @dest : destination address of the flit                  |
|----------------------------------------------------------|
| R_ID : current router address                            |
| Py : North direction                                     |
| Px : East direction                                      |
| L : locale                                               |
| Ny : South direction                                     |
| Nx : West direction                                      |
| if $(@dest \ge R_ID + 4 \text{ or } @dest = R_ID + 11$   |
| $@dest = < R_ID - 12)$ then                              |
| direction $\leq =$ Py ;                                  |
| elsif (@dest = $R_ID + 2$ or @dest = $R_ID + 2$ or @dest |
| $=$ R_ID - 3) then                                       |
| direction <= Px ;                                        |
| elsif (@dest = $R_{ID}$ ) then                           |
| direction $\leq L$ ;                                     |
| elsif (@dest =< $R_ID - 4$ and @dest >= $R_ID - 12$ )    |
| then                                                     |
| direction <= Ny ;                                        |
| elsif (@dest = $R_ID - 1$ or @dest = $R_ID - 2$ ) then   |
| direction <= Nx ;                                        |
| end if;                                                  |

## 5. Router design

### 5.1 Router pipeline stages

The router is the backbone of the NoC design and defines its communication architecture. As presented in Figure 1, the proposed router architecture is primarily composed of three pipeline stages; the routing calculation, switch allocation and crossbar traversal. First, packets are received at the input ports from neighboring routers or from the local connected core. Then, the routing calculation and the arbitration process are performed at the same time in order to reduce the latency. Next and by the end of those two processes, it sends information about the destination port to the crossbar in a way that it finally establishes a connection allowing packets to reach its target.



Fig. 2 Router pipeline stages.

## 5.2 Switching technique

The proposed architecture applies the well-known wormhole switching technique. As displayed in Figure 3, the packet is composed by the header flit and the body flit. Each flit is 32 bits size. The first four bits of the header flit are dedicated to the destination address. The fifth bit is dedicated to the QoS types. The next four to the number of flits per packet to devote the following three are dedicated to packet priority and the rest of the flit is an extension. The body flit is 32 bits payload size. We can change the flit size regarding the application requirements.



Fig. 3 Packet format.

### 5.3 Switch allocation

Simultaneously as the routing calculation is performed, the switch allocator receives signals from the input port about the priority of the packets and the port requested. The switch allocation is composed by a priority scheduler block and round-robin arbiter. This scheme served the biggest packet priority to the selected output port in fair way.

#### 5.4 Crossbar

As shown in figure 4, the crossbar is composed of maximum of six multiplexers. It waits for the notification concerning the chosen output port from the switch allocator. Based on this notification, it forwards flits to the adequate output port. The number of flits per packet notifies the crossbar about the end of transmission and the channel can then be used for other flits.



Fig. 4 Crossbar circuit

## 6. Experimental results

This section gives the synthesis results, the performance analysis and the evaluation of the router implementation. The proposed router has been designed with VHDL language at the register transfer level (RTL). It was simulated and synthesized using the ISE 13.1 tool of Xilinx. It was then implemented in two different FPGA, Virtex5 and Virtex6 as shown in Table 1.

The implementation and evaluation results targeted both FPGA and ASIC technologies and are provided in terms of maximal clock frequency, area, power consumption and the estimated peak performance. The operating frequency of the router allows us to calculate the estimated peak performance (PP). The PP depends on the maximal operating frequency (Fmax) of the router, the clock cycle time (T) for the transmission of one flit and the flit size:

| $PP_{perport} = (F_{max} / $ | T) * flit size |
|------------------------------|----------------|
|------------------------------|----------------|

Table 1: Proposed router results with same FPGA of other works

| FPGA                                                | Virtex5            | Virtex6            |  |
|-----------------------------------------------------|--------------------|--------------------|--|
| Design                                              | 2D router          | 2D Router          |  |
| Topology                                            | Torus              | Torus              |  |
| Number of ports                                     | 5                  | 5                  |  |
| Routing algorithm                                   | Negative-<br>first | Negative-<br>first |  |
| Frequency (MHz)                                     | 262                | 270                |  |
| Area (Slice)                                        | 459                | 446                |  |
| Power estiamtion<br>(mW)                            | 20                 | 18                 |  |
| Estimated Peak<br>Performance per port<br>(Gbits/s) | 83.84              | 86.40              |  |

This research's aim is to deliver a comparative study for different NoC topologies and support designers to choose carefully their NoC architecture. In [11], authors describe a router implantation on virtex2 and virtex5 FPGA. They describe their flexible and extensible design. It supports the diagonal mesh topology, adopts a deterministic routing algorithm, uses the packet switching, and dynamic arbiter. In [12], writers present 2D mesh topology NoC based on a virtual router. Their architecture uses two versions; one for reducing the resources cost and the other for reducing the latency of the network. In [13], we describe a router design for 3D NoC topology. We used the turn model negativefirst routing algorithm to avoid dead-locks. In [14], authors describe buffer-less router architecture for 3D NoC. Their architecture uses minimal routing and has been implemented in Virtex-6 FPGA. As demonstrated in Table 2, we compare the proposal results with other works. The results are presented in terms of area, maximal clock frequency and estimated peak performance. Compared to [11-14] the proposal outperforms their designs when it comes to area. In comparison with [11-13], the proposal outpaces their design when speaking of maximal clock frequency. Compared to [14] the proposal underperforms their design in terms of maximal clock frequency. In comparison with [11, 13, 14], the proposal outpaces their design when it comes to estimated peak performance.

| Design                          | [11]          | [12]                                  | [13]    | [14]    |
|---------------------------------|---------------|---------------------------------------|---------|---------|
| Topology                        | Diagonal      | 2D                                    | 3D      | 3D      |
|                                 | Mesh          | Mesh                                  | Mesh    | Mesh    |
| Number of ports                 | -             | 5                                     | 7       | 7       |
| Routing                         | Deterministic | XY                                    | Negativ | Minimal |
| algorithm                       | routing       | routin                                | e-first | routing |
| -                               |               | g                                     | routing |         |
| Frequency<br>(MH <sub>7</sub> ) | 200           | 23                                    | 195     | 353     |
| (MIII2)                         |               |                                       |         |         |
| (Slice)                         | 989           | 25821                                 | 7847    | 788     |
| Power                           |               |                                       |         |         |
| estiamtion                      | 33            | -                                     | -       | -       |
| ( <i>mW</i> )                   |               |                                       |         |         |
| Estimated                       |               |                                       |         |         |
| Peak                            |               |                                       |         |         |
| Performan                       | 8.44          | -                                     | 62.4    | 16      |
| ce per port                     |               |                                       |         |         |
| (Gbits/s)                       |               |                                       |         |         |
| FPGA                            | Virtex5       | Virtex6                               | Virtex6 | Virtex6 |
| device                          |               | · · · · · · · · · · · · · · · · · · · |         |         |

| Table 2. | State of | f the art | of routers | in | FPGA |
|----------|----------|-----------|------------|----|------|

# 7. Conclusion

We accordingly suggest router architecture for torus NoC topology. The router pipeline stages are presented in detail such as routing calculation, switch allocation and crossbar traversal. To evaluate the performance of the proposal, we compared it with other woks in terms of maximal clock frequency, area, estimated power consumption and estimated peak performance. Evaluation results show that clock frequency wise, the proposal is faster than those of three other works. In terms of hardware cost, the proposal is 2.15, 57.89, 17.59 and 1.76 times smaller than the other routers which addressed diagonal mesh, 2D and 3D mesh topologies.

#### References

- Salah, Y., & Tourki, R. (2011, December). Design and fpga implementation of a qos router for networks-on-chip. In 2011 3rd International Conference on Next Generation Networks and Services (NGNS) (pp. 84-89). IEEE.
- [2] Attia, B., Chouchene, W., Zitouni, A., Abid, N., & Tourki, R. (2011, March). A modular router architecture desgin for Network on Chip. In Systems, Signals and Devices (SSD), 2011 8th International Multi-Conference on (pp. 1-6). IEEE.

- [3] Elhaji, M., Boulet, P., Zitouni, A., Meftali, S., Dekeyser, J. L., & Tourki, R. (2012). System level modeling methodology of NoC design from UML-MARTE to VHDL. Design Automation for Embedded Systems, 16(4), 161-187.
- [4] Chemli, B., & Zitouni, A. (2016, November). Design and Evaluation of Optimized router pipeline stages for Network on Chip. In Image Processing, Applications and Systems Conference (IPAS), 2016 Second International. IEEE.
- [5] Ahmed, A. B., & Abdallah, A. B. (2012, August). ONoC-SPL Customized Network-on-Chip (NoC) Architecture and Prototyping for Data-intensive Computation Applications. In Proceedings of the 4th International Conference on Awareness Science and Technology, Seoul, Korea (Vol. 2124, p. 257262).
- [6] Salah, Y., Atri, M., & Tourki, R. (2007, December). Design of a 2d mesh-torus router for network on chip. In 2007 IEEE International Symposium on Signal Processing and Information Technology (pp. 626-631). IEEE.
- [7] Wissem, C., Attia, B., Noureddine, A., Zitouni, A., & Tourki, R. (2011, December). A quality of service network on chip based on a new priority arbitration mechanism. In ICM 2011 Proceeding (pp. 1-6). IEEE.
- [8] Asghari, S. A., Pedram, H., Khademi, M., & Yaghini, P. (2009). Designing and Implementation of a Network on Chip Router Based on Handshaking Communication Mechanism. World Applied Sciences Journal, 6(1), 88-93.
- [9] Salah, Y., Kaddachi, M. L., & Tourki, R. (2013). FPGA Hardware Implementation and Evaluation of a Micro-Network Architecture for Multi-Core Systems. World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering, 7(1), 53-59.
- [10] Chemli, B., & Zitouni, A. (2015, December). Design of a Network on Chip router based on turn model. In Sciences and Techniques of Automatic Control and Computer Engineering (STA), 2015 16th International Conference on (pp. 85-88). IEEE.
- [11] Elhajji, M., Attia, B., Zitouni, A., Tourki, R., Meftali, S., & Dekeyser, J. L. (2011, November). FERONOC: flexible and extensible router implementation for diagonal mesh topology. In Design and Architectures for Signal and Image Processing (DASIP), 2011 Conference on (pp. 1-8). IEEE.
- [12] Chatmen, M. F., Baganne, A., & Tourki, R. (2016). New design of Network on Chip Based on Virtual Routers. Indonesian Journal of Electrical Engineering and Computer Science, 2(1), 115-131.
- [13] Chemli, B., & Zitouni, A. (2014). A Turn Model Based Router Design for 3D Network on Chip. World Applied Sciences Journal, 32(8), 1499-1505
- [14] Yu, X., Li, L., Zhang, Y., Pan, H., & He, S. (2013, June). Mass message transmission aware buffer-less packet-circuit switching router for 3D Noc. In Control and Automation (ICCA), 2013 10th IEEE International Conference on (pp. 983-986). IEEE.



Bouraoui Chemli received his master's and the masters of research degrees in electronics and microelectronics from the Faculty of Sciences of Monastir, Tunisia, in 2010 and 2012 respectively. He is currently Ph.D. student at electronics and microelectronics laboratory at the Faculty of Sciences of Monastir, Tunisia. His current research interests in network on



Abdelkrim Zitouni was born in Gabe's, Tunisia, on 6 October 1970. He received the DEA and the PhD degree in Physics (Electronics option) from the Faculty of Sciences of Monastir, Tunisia, in 1996 and 2001 respectively. He received the HDR degree in Physics (Electronics option) in 2009. Since this date he has been professor in Electronics and Microelectronics with

the Physics Department in Faculty of Sciences of Monastir. His researches interest, communication synthesis for SoC, video coding and asynchronous system design.