# FPGA Implementation of Efficient Fast Convolution Architecture Based Discrete Wavelet Transform

P.Arulselvan<sup>1</sup>, C.Karthik<sup>2</sup>, M.Peer mohamed<sup>3</sup> PG Scholars, Government College of Technology, Coimbatore-641 013

#### Abstract

This paper presents a VLSI design approach for a efficient and high speed 1D Discrete Wavelet Transform computing reduces the hardware complexity in addition to reduce the critical path to the multiplier delay. The hardware requirement is a major concern in the processing of discrete wavelet transform. The system is verified, using (9,7)filter coefficients on Xilinx Sparton-3E Field Programmable Gate Array(FPGA) device without accessing any external memory. It is observed that the approximation method for constant multiplier implementation in DWT can increases the speed and reduces the hardware requirement for the computation of Discrete Wavelet Transform. In this way, the developed design requests reduced combinational path delay, computation power and provide very high-speed processing.

#### Keywords- DWT, Fast Convolution, VLSI, FPGA

# **1. Introduction**

The discrete wavelet transform (DWT) has gained wide popularity due to its excellent de-correlation property. Many modern image and video compression systems embody the DWT as the transform stage. It is widely recognized that the (9,7) filters are among the best filters for DWT-based image compression [1]. In fact, the JPEG2000 image coding standard employs the (9,7) filters as the default wavelet filters for lossy compression.

Lifting and convolution present the two computing approaches to achieve the discrete wavelet transform. While conventional lifting based architectures require fewer arithmetic operations compared to the convolution-based approach for DWT, they sometimes have long critical paths. If Ta and Tm are the delays of the adder and multiplier, respectively, then the critical path of the lifting based architecture for the (9, 7) filter is  $(4 \times Tm + 8 \times Ta)$ , while that of the convolution implementation is  $(Tm + 2 \times Ta)$  [2]. In addition to this and for the reason to preserve proper precision, intermediate variables widths are larger in lifting -based computing. As a result, the lifting multiplier and adder delays are longer than the convolution ones. Hence convolution is a best method to reduce the delays in the computation of DWT [5].

Conventionally, programmable DSP chips are used to implement DWT algorithms for low-rate applications and the VLSI application specific integrated circuits (ASICs) for higher rates. The FPGAs are programmable logic devices that provide sufficient quantities of logic resources that can be adapted to support a large parallel distributed architecture.

# 2. Discrete Wavelet Transform Features

The discrete wavelet transform is a mathematical tool that has aroused great interest in the field of image processing due to its nice features. Some of these characteristics are:

• It allows image multi resolution representation in a natural way because more wavelet sub bands are used to progressively enlarge the low frequency subbands.

• It supports wavelet coefficients analysis in both space and frequency domains, thus the interpretation of the coefficients is not constrained to its frequency behaviour and can perform better analysis for image vision and segmentation; and

• For natural images, the DWT achieves high compactness of energy in the lower frequency subbands, which is extremely useful in applications such as image compression.

# 2.1 1D Discrete Wavelet Transform

The input discrete signal X (n) is filtered by a low-pass filter (h) and a high-pass filter (g) at each transform level. The two output streams are then sub-sampled by simply dropping the alternate output samples in each stream to produce the low pass subband  $(Y_L)$  and high pass sub-band  $(Y_H)$ . The associated equations can be written as (1). Figure 1 shows the signal analysis in one dimensional (1D) Discrete Wavelet Transform.



$$y_H(n) = \sum_{i=0}^{2} g(2n-i).x(i)$$



Figure 1 One Dimensional (1D) Discrete Wavelet transform

#### 2.2 2D Discrete Wavelet Transform

The basic idea of 2-D architecture is similar to 1D architecture [3]. A 2D DWT (Figure 2) can be seen as a 1D wavelet scheme which transform along the rows and then a 1D wavelet transform along the column. The 2D DWT operates in a straightforward manner by inserting array transposition between the two 1D DWT.



X-Direction

Figure 2 Two Dimensional (2D) Discrete Wavelet transform

# 3. Existing 1D Discrete Wavelet Transform Architectures

# 3.1 Basic 1D-DWT Architecture for FIR Filter

A basic implementation of a 1D DWT has been done by using the Daubechies biorthogonal wavelet coefficients. Two different output bands are produced by applying two FIR filters on data input samples. A low-pass filter using h(x) coefficients produces lowfrequency data and a high-pass filter using g(x)coefficients produces high-frequency data. As an example, Figure 3 shows the 9/7-tap Daubechies DWT consisting of a 7-tap high-pass filter and a 9-tap low-pass filter.



Figure 3 Basic 1D-DWT Architecture by (9,7) Daubechies Filter

#### 3.2 Convolution Based 1D DWT Architecture

The 1D DWT convolution method used to reduce the architecture complexity by ease the number of adders and multipliers from the basic (9,7) FIR filter architecture. The architecture provides the convolution outputs for high pass sub-band and low pass sub-band outputs separately for high speed processing [4].



Figure 4 Convolution based 1D DWT Architecture

# 4. Fast Convolution Based 1D Discrete Wavelet Transform

The digital filter is generally comprised of plurality of multipliers, which occupy large areas and consume much power, impose constraints on a one-chip solution when circuits are integrated. In this aspect, efforts have been expanded to reduce the associated hardware complexity by simplifying multipliers in the convolution architectures [3-7].

The proposed architecture shown in Figure 5, registers is added to send either LPF or HPF coefficient. This proposed method makes the architecture of LPF and HPF suitable for implementing a multiplier-less architecture, as the coefficients to be multiplied in LPF and HPF are different [5].



Figure 5 The Convolution based implementation of 1D DWT

The proposed architecture, based on fast convolution approach, presents high speed discrete wavelet transform implementation with reducing dynamic power and computational delay. The (9,7) filter used to reduce the number of multipliers and use common multipliers to get both high pass and low pass values in different clock cycles [8]. These (9, 7) filter has 9 low-pass filter coefficients  $h = \{h_{.4}, h_{.3}, h_{.2}, h_{.1}, h_0, h_1, h_2, h_3, h_4\}$  and 7 high-pass filter coefficients  $g = \{g_{.2}, g_{.1}, g_{0}, g_{1}, g_{2}, g_{.3}, g_{4}\}$  and present symmetry ( $h_i = h_{.i}$ ). The low-pass filter coefficients present symmetry as follows,

$$\begin{split} y_{L0} &= h_0(0+x_0) + h_1(0+x_1) + h_2(0+x_2) + h_3(0+x_3) + h_4(0+x_4) \\ y_{L1} &= h_0(0+x_2) + h_1(x_1+x_3) + h_2(x_0+x_4) + h_3(0+x_5) + h_4(0+x_6) \\ y_{L2} &= h_0(0+x_4) + h_1(x_3+x_5) + h_2(x_2+x_6) + h_3(x_1+x_7) + h_4(x_0+x_8) \end{split}$$

$$(3)$$
  
$$y_{\frac{N}{2}-2} = h_0(x_{N-4}+0) + h_1(x_{N-5}+x_{N-3}) + h_2(x_{N-6}+x_{N-2}) + h_3(x_{N-7}+x_{N-1}) + h_4(x_{N-8}+0)$$
  
$$y_{\frac{N}{2}-1} = h_0(x_{N-2}+0) + h_1(x_{N-3}+x_{N-1}) + h_2(x_{N-4}+0) + h_3(x_{N-5}+0) + h_4(x_{N-6}+0)$$

Similarly, the high-pass filter coefficients present symmetry as follows,

$$y_{H0} = g_1(0+0) + g_2(0+x_0) + g_3(0+x_1) + g_4(0+x_2)$$

$$\begin{split} y_{H1} &= g_1(0+x_1) + g_2(x_0+x_2) + g_3(0+x_3) + g_4(0+x_4) \\ y_{H2} &= g_1(0+x_3) + g_2(x_2+x_4) + g_3(x_1+x_5) + g_4(x_0+x_6) \\ & \dots \dots \dots \dots (4) \\ y_{H\frac{N}{2}-2} &= g_1(0+x_{N-5}) + g_2(x_{N-6}+x_{N-4}) + g_3(x_{N-7}+x_{N-3}) + g_4(x_{N-8}+x_{N-2}) \\ y_{H\frac{N}{2}-1} &= g_1(0+x_{N-3}) + g_2(x_{N-4}+x_{N-2}) + g_3(x_{N-5}+x_{N-1}) + g_4(x_{N-6}+0) \end{split}$$



Figure 6 The proposed Fast Convolution 1D (9,7) DWT Architecture

Furthermore, the outputs and are obtained alternately at the trailing edges of even and odd clock cycles [8-10].

| TABLE I. | TIMING AND FREQUENCY |  |
|----------|----------------------|--|
| ANALYSIS |                      |  |

| Timing Details              | Convolution<br>Architecture | Proposed<br>Architecture |
|-----------------------------|-----------------------------|--------------------------|
| Minimum period              | 10.765 ns                   | 4.343 ns                 |
| Max Frequency               | 92.894 MHz                  | 230.256 MHz              |
| Combinational Path<br>Delay | 16.651 ns                   | 7.059 ns                 |

| On Chip | Convolution<br>Architecture<br>(mW) | Proposed<br>Architecture<br>(mW) |
|---------|-------------------------------------|----------------------------------|
| Clocks  | 1.75                                | 1.41                             |
| Logic   | 1.80                                | 1.10                             |
| Signals | 2.29                                | 0.88                             |
| I/O's   | 0.63                                | 0.16                             |
| TOTAL   | 6.46                                | 3.56                             |

#### TABLE II. DYNAMIC POWER ANALYSIS

# 5. Simulation Results

The MATLAB simulation results of 1D discrete wavelet transform on a 256\*256 sized gray scale image of "Baboon" and "Camera man" is being illustrated in the Figures below. Figure 7 shows the average and detail parts of the "Baboon" and "Camera man" image accurately after one dimensional filtering. Figure 7 shows the approximation image (average), the horizontal, vertical and diagonal details. Progressive transmission of image is one of the main advantages of discrete wavelet transform.



(a) Input image







(a) Input image

Figure 7 Matlab Output Images

#### 6. CONCLUSION

In this paper, we have proposed a parallel architecture for very high-speed computing DWT using

fast convolution method. To produce one output in every clock cycle in addition to reduce the dynamic power as well as critical path, fast convolution based architecture approach is performed. In this approach, the systems start the column processing as soon as sufficient numbers of rows have been filtered.

#### References

[1] Mallat, S.: "A Theory for Multiresolution Signal Decomposition: The Wavelet Representation" IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 11, No. 7. (1989) 674-693

[2] Gnavi, S., Penna, B., Grangetto, M., Magli, E., Olmo, G.: "Wavelet kernels on a DSP: A comparison between lifting and filter banks for image coding". Applied Signal Processing: Special Issue on *Implementation of DSP and Communication Systems*. Vol. 2002. No. 9. (2002) 981-989

[3] Acharya, T.: "Architecture for Computing a Two-Dimensional Discrete Wavelet Transform". US Patent 6178269. (2001)

[4] Acharya, T., Chen, P."VLSI Implementation of a DWT Architectue", Proceedings of the IEEE International Symposium on Circuits and Systems(ISCAS).Monterey,CA.(1998)

[5] B.-F.Wu and C.-F. Lin, "An efficient architecture for JPEG2000 coprocessor," IEEE Trans. Consum. Electron, vol. 50, no. 4, pp. 1183–1189, Nov. 2004

[6] Andra, K., Chakrabarti, C, Acharya,T : "A VLSI Architecture for Lifting-Based Forward and Inverse Wavelet Transform". IEEE Transactions on Signal Processing, vol. 50. No. 4. (2002) 966-977

[7] Daubechies, I., Sweldens, W.: "Factoring wavelet transforms into lifting schemes. The Journal of Fourier Analysis and Applications" vol. 4. (1998) 247-269

[8] Gaurav Tewari, Santu Sardar, K. A. Babu, "High-Speed & Memory Efficient 2-D DWT on Xilinx Spartan3A DSP using scalable Polyphase Structure with DA for JPEG2000 Standard," IEEE, 2011

[9] Q.P.Huang, R.Z.Zhou, and Z.L Hong,"Low memory and low complexity VLSI implementation of JPEG2000 codec," *IEEE Trans. Consum. Electron.*, vol.50, no.2,pp. 638-646, May 2004

[10] K.Z.Mei, N.N.Zheng, C.Huang, Y.Liu, and Q.Zeng, "VLSI Design of a High-Speed and Area-Efficient JPEG2000 Encoder," IEEE Trans. Circuits Syst. Video Technol., vol.17, no.8, pp. 1065-1078, Agu. 2007.