- Open Access
- Total Downloads : 251
- Authors : Kasa Srinivasulu, K.Satya Sujith, C.Laxmana Sudheer, Y.Chamundeswari
- Paper ID : IJERTV2IS90630
- Volume & Issue : Volume 02, Issue 09 (September 2013)
- Published (First Online): 19-09-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Multiplier Less ROM Free DA based DCT
1Kasa Srinivasulu, 2K.Satya Sujith,I.S.T.E ,3C.Laxmana Sudheer(M.tech),4Y.Chamundeswari(M.tech) 1Associate Professor,2,3,4 Assistant Professor
1Brahmaiah College Of Engg.,2 Audisankara Institute Of Technology
Abstract Discrete cosine transform (DCT) is a widely used tool in image and video compression applications. Recently, the high-throughput DCT designs have been adopted to fit the requirements of real-time application. Operating the shifting and addition in parallel, an optimized adder tree (OAT) is proposed to deal with the truncation errors and to achieve low-error and high-speed discrete cosine transform (DCT) design. Instead of the 12 bits used in previous works, 9-bit Distributed Arithmetic was proposed. DA-based DCT design with an optimized adder tree (OAT) is the proposed architecture in which, OAT operates shifting and addition in parallel by unrolling all the words required to be computed. Furthermore, the Error-Compensated Circuit alleviates the truncation error for high accuracy design. Based on low-error OAT, the DA-precision in this work is chosen to be 9 bits instead of the traditional 12 bits. Therefore, the hardware size and cost is reduced, and the speed is improved using the proposed OAT.
Keywords- Adders, DCT- Discrete Cosine Transform, DA- Distributed Arithmetic, OAT- optimized adder tree.
I.INTRODUCTION
Today we are talking about digital networks, digital representation of images, movies, video, TV, voice, digital library-all because digital representation of the signal is more robust than the analog counterpart for processing, manipulation, storage, recovery, and transmission over long distances, even across the globe through communication networks. In recent years, there have been significant advancements in processing of still image, video, graphics, speech, and audio signals through digital computers in order to accomplish different application challenges. As a result, multimedia information comprising image, video, audio, speech, text, and other data types has the potential to become just another data type. Development of efficient image compression techniques continues to be an important challenge to us, both in academia and in industry [1]. In [2] multiplier based DCTs were implemented, later to reduce area ROM-based DA was applied for designing DCT [3]. Then knowing the advantage of ROM-based, DA-based multipliers using ROMs were implemented to produce partial products together with adders that accumulated these partial products. By applying DA-based ROM to DCT core design we can reduce the area required. In addition, the symmetrical properties of the DCT transform and parallel DA architecture can be used in reducing the ROM size in [4], respectively. Recently, ROM-free DA architectures were presented [6][11]. Shams et al. employed a bit-level sharing scheme to construct the adder-based butterfly matrix called
new DA (NEDA) [7]. Being compressed, the butterfly-adder-matrix in [7] utilized 35 adders and 8 shift-addition elements to replace the ROM. Based on NEDA architecture, the recursive form and ALU were applied in DCT design to reduce area cost [8], [9], but speed limitations exist in the operations of serial shifting and addition after the
DA-computation. In DA-based computation partial products words are shifted and added in parallel [10] and [11]. However, a large truncation error occurred.
We need to reduce truncation error that error is introduced if the least significant part is directly truncated. In order to reduce truncation error effect several error compensation bias methods have been presented based on statistical analysis of relationship between partial product and multiplier-multiplicand. Hardware complexity will be reduced if truncation error minimized. In general, the truncation part (TP) is usually truncated to reduce hardware costs in parallel shifting and addition operations, known as the direct truncation (Direct-T) method. Thus, a large truncation error occurs due to the neglecting of carry propagation from the TP to Main Part (MP). Distributed arithmetic is a bit level rearrangement of a multiply accumulate to hide the multiplications. It is a powerful technique for reducing the size of a parallel hardware multiply-accumulate that is well suited to FPGA designs. The Discrete cosine transform (DCT) is widely used in digital image processing for image compression, especially in image transform coding. However, though most of them are good software solutions to the realization of DCT, only a few of them are really suitable for VLSI implementation. Cyclic convolution plays an important role in digital signal processing due to its nature of easy implementation. Specifically, there exist a number of well-developed convolution algorithms and it can be easily realized through modular and structural hardware such as distributed arithmetic and systolic array.
-
MATHEMATICAL DERIVATION OF DISTRIBUTED ARITHMETIC
The inner product is an important tool in digital signal processing
applications. It can be written as follows:
L1
Y=AT.X= AiXi (1)
i1
where Ai, Xi and L are ith fixed coefficient, ith input data, and number of inputs, respectively. Assume that coefficient Ai is Q-bit
twos complement binary fraction number. Equation (1) can be expressed as follows:
Y= [20
2-1
2-2
.. 2
-(Q-1) ]
A10 A11
A20 A21
…..
…..
AL0
AL1
x1
x 2
.
. . . . .
. . .
. .
A1(Q 1) A2(Q 1) ….. AL(Q 1) .
Y= [20 2-1 2-2 .. 2-(Q-1) ]
y
y
y0
y1
.
.
.
( Q
1)
xL
Ai,j stay between [1 , 0] Note that y0 may be 0 or a negative number due to twos complement representation. In (2), y0 can be calculated by adding all Xi values when Ai,j=1 and then the transform output Y can be obtained by shifting and adding all nonzero yi values. Thus the inner product computation in (1) can be implemented by using shifting and
adders instead of multipliers. Therefore, low hardware cost can be achieved by using DA-based architecture.
Fig:2 proposed optimized adder tree architecture
IV. PROPOSED 8X8 2-D DCT DESIGN
N 1
N 1
The 1-D DCT employs the DA-based architecture and the proposed Optimized adder tree to achieve a high-speed, small area, and low-error design. The 1-D 8-point DCT can be expressed as follows:
Zn 1
x cos 2m 1n
-
PROPOSED OPTIMIZED ADDER TREE ARCHITECTURE
2 Kn m m0
16
In general, the shifting and addition computation uses a shift-and-add operator in VLSI implementation in order to reduce hardware cost. However, when the number of the shifting and addition words increases, the computation
time will also increase. Therefore, the shift-adder-tree (SAT)
Where xm denotes the input data;
Zn denotes the transform output.
By neglecting the scaling factor 1/2, the 1-D 8-point DCT in above equation can be divided into even and odd parts: Ze and Zo as listed in below equations, respectively
presented operates shifting and addition in parallel by unrolling all the words needed to be computed for high-speed
Z 0 C4 C 4
Z 2 C2 C6
C 4
-
C6
C 2
a0
a1
applications. However, a large truncation error occurs in
Ze
SAT, and optimized adder tree architecture is proposed in this
Z 4 C4
Z
Z
6 C6
-
C4
-
C
-
C4
C
C 4
-
C
a2
a
brief to compensate for the truncation error in high-speed applications
2 2 6
2
Z 1 C1
Z 3 C3
C3
-
C7
C5
-
C1
C7
C5
b0
b1
Z 0
Z 5 C5 C1 C7 C3 b2
7 C7 C C
Z
Z
C
5 3 1
b2
Fig:1: Q,P bit words shifting and addition operations in parallel.
In Fig. 1, the Q P-bit words operate the shifting and addition in parallel by unrolling all computations. Furthermore, the operation in Fig. 1 can be divided into two parts: the main part (MP) that includes _ most significant bits (MSBs) and the truncation part (TP) that has least significant bits (LSBs). a large truncation error occurs due to the neglecting of carry propagation from the TP to MP.
The proposed optimized adder tree architecture is illustrated in Fig. 2 for (P,Q)=(12,6) , where block FA indicates a full-adder cell with three inputs (a, b, and c) and two outputs, a sum (s) and a carry-out (co). Also, block HA indicates half-adder cell with two inputs (a and b) and two outputs, a sum (s) and a carry-out (co).
Where Ci=COS (i/16)
Below shows bit level formulation for Z0 and Z4 Let see Z4 evaluation
Z1
Z4
weight
value
weight
value
-20
0
-20
A1
2-1
B0+B1+B2
2-1
A0
2-2
B0+B1
2-2
A1
2-3
B0+B3
2-3
A0
2-4
B0+B1+B3
2-4
A0
2-5
B0+B2
2-5
A1
Z1
Z4
weight
value
weight
value
-20
0
-20
A1
2-1
B0+B1+B2
2-1
A0
2-2
B0+B1
2-2
A1
2-3
B0+B3
2-3
A0
2-4
B0+B1+B3
2-4
A0
2-5
B0+B2
2-5
A1
TABLE: 1: bit level formulation
Where A0=(X0+X7)+(X4+X3)=a0+a3
to compute the inverse DCT using 64-bit double-precision operations. The proposed DCT core has
the highest hardware efficiency, defined as follows (based on
the accuracy required by the presented standards)
A1= (X1+X6)+(X2+X5) =a1+a2 B0=(X0+X7)-(X4+X3)=a0-a3 B1= (X1+X6)-(X2+X5) =a1-a2
hardwareefficiency =
Throughput rate gate count
Input data A0 and A1, the transform output Z0 needs only one adder to compute (A0 + A1) and two separated optimized adder trees to obtain the results of Z0 and Z4. Similarly, the other transform outputs Zo and Z4 can be implemented in DA-based forms using 10(=1 + 9) adders and corresponding optimized adder trees. Consequently, the proposed 1-D 8-point DCT architecture can be constructed as illustrated in Fig. 3 using a DA-Butterfly-Matrix, that includes two DA even processing elements (DAEs), a DA odd processing element (DAO) and 12 adders/subtractors, and 8 optimized adder trees (one optimized adder tree for each transform output Zn). The eight separated optimized adder trees work simultaneously, enabling high-speed applications to be achieved. After the data output from the DA-Butterfly-Matrix is completed, the transform output Z will be completed during one clock cycle by the proposed optimized adder trees. In contrast, the traditional shift-and-add architecture requires Q clock cycles to complete the transform output Z if the DA-precision is Q-bits here is 9 bits
Fig 3. Architecture of the proposed 1-D 8-point DCT.
With high-speed considerations in mind, the proposed 2-D DCT is designed using two 1-D DCT cores and one transpose buffer. For accuracy, the DA-precision and transpose buffer word lengths are chosen to be 9 bits and 12 bits, respectively, meaning that the system can meet the PSNR requirements outlined in previous works. Moreover, the 2-D DCT core accepts 9-bit image input and 12-bit output precision.
-
RESULTS AND DISCUSSIONS
The test image Lena used to check system accuracy is comprised of 256X256 Pixels with each pixel being represented by 8-bit 256 gray level data. After inputting the original test image pixels to the proposed 2-D DCT core, the transform output data is captured and fed into MATLAB
Furthermore, the proposed 2-D DCT core synthesized by using Xilinx ISE 10.1, simulated by using modelsim6.4d and the Xilinx XC2VP30 FPGA can achieve 1067 megapixels per second (M-pels/sec) throughput rate which is 7 folds of previous work of [16]
FIG4: Simulation result of DA-DCT
-
CONCLUSION
-
The paper contributed with specific simplifications in the multiplier stage, by using shift and adds method, which lead to hardware simplification and speed up over architecture. The proposed 8X8 2-D DCT core has a latency of 10 clock cycles and is operated at 125 MHz As a result of the 8 parallel outputs, the proposed 2-D DCT core can achieve a throughput rate of 1 Gpixels per second (8X125MHz), meeting the 1080 p (1920X1080X60 pixels/s) high definition television (HDTV) specifications for 200 MHz based on low power operations. The maximum throughput rate is 1 Gpels/s.as, the proposed architecture is suitable for high compression rate applications in VLSI designs.
REFERENCES:
-
Y.Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.
-
Y. Chang and C.Wang, New systolic array implementation of the 2-D discrete cosine transform and its inverse, IEEE Trans. Circuits Syst Video Technol., vol. 5, no. 2, pp. 150157, Apr. 1995.
-
C. T. Lin, Y. C. Yu, and L. D. Van, Cost-effective triple-mode reconfigurable pipeline FFT/IFFT/2-D DCT processor, IEEE
Trans. Very Large Scale Integr. Syst., vol. 16, no. 8, pp. 10581071, Aug. 2008.
-
S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita,
H. Yerane, and M. Yoshimoto, A 100-MHz 2-D discrete cosine transform core processor, IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 492499, Apr. 1992.
-
S. Yu and E. E. S. , Jr., DCT implementation with distributed arithmetic, IEEE Trans. Comput., vol. 50, no. 9, pp. 985991, Sep. 2001. 714 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011
-
P. K. Meher, Unified systolic-like architecture for DCT and DST using distributed arithmetic, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 12, pp. 26562663, Dec. 2006.
-
A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA: A low-power high-performance DCT architecture, IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955964, Mar. 2006.
-
M. R. M. Rizk and M. Ammar, Low power small area high performance 2D-DCT architecture, in Proc. Int. Design Test Workshop, 2007, pp. 120125.
-
Y. Chen, X. Cao, Q. Xie, and C. Peng, An area efficient high performance DCT distributed architecture for video compression, in Proc. Int. Conf. Adv. Comm. Technol., 2007, pp. 238241.
-
C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized distributed architecture of 2D 88 DCT, in Proc. Int. Conf. ASIC, 2007, pp. 189192.
-
C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 2124.
-
S. S. Kidambi, F. E. Guibaly, and A. Antonious, Area-efficient multipliers for digital signal processing applications, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 43, no. 2, pp. 9095, Feb. 1996.
-
K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, Design of low-error fixed-width modified booth multiplier, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 522531, May 2004.
-
L. D. Van and C. C. Yang, Generalized low-error area-efficient fixedwidth multipliers, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 8, pp. 16081619, Aug. 2005.
-
C. C. Sun, P. Donner, and J. Gotze, Low-complexity multi-purpose IP core for quantized discrete cosine and integer transform, in Proc.IEEE Int. Symp. Circuits Syst., 2009, pp. 30143017.
-
A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto, a pipelined fast 2D-DCT accelerator for FPGA-based SoCs, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2007, pp. 331336
-
S. Ghosh, S. Venigalla, and M. Bayoumi, Design and implementation of a 2D-DCT architecture using coefficient distributed arithmetic, in Proc. IEEE Comput. Soc. Ann. Symp. VLSI, 2005, pp. 162166.
-
Yuan-Ho Chen, Tsin-Yuan Chang, and Chung-Yi Li, High Throughput DA-Based DCT With High Accuracy
Error-Compensated Adder Tree IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011