Quad-Byte Transformation as a Pre-Processing to Arithmetic Coding

Jyotika Doshi; Savita Gandhi

doi:10.17577/IJERTV2IS120938

Volume 02, Issue 12 (December 2013)

Quad-Byte Transformation as a Pre-Processing to Arithmetic Coding

DOI : 10.17577/IJERTV2IS120938

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 247
Total Downloads : 183
Authors : Jyotika Doshi, Savita Gandhi
Paper ID : IJERTV2IS120938
Volume & Issue : Volume 02, Issue 12 (December 2013)
Published (First Online): 24-12-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Quad-Byte Transformation as a Pre-Processing to Arithmetic Coding

Jyotika Doshi

GLS Inst.of Computer Technology Opp. Law Garden, Ellisbridge Ahmedabad-380006, INDIA

Savita Gandhi

Dept. of Computer Science; Gujarat University Navrangpura Ahmedabad-380009, INDIA

Abstract

Transformation algorithms are an interesting class of data-compression techniques in which one can perform reversible transformations on datasets to increase their susceptibility to other compression techniques. In recent days, due to its optimal entropy, arithmetic coding is the most widely preferred entropy encoder in most of the compression methods. In this paper, we have proposed QBT-I (Quad-Byte Transformation using Indexes) method to be used as a pre-processing stage before applying arithmetic coding compression method. QBT-I is intended to introduce more redundancy in the data and make it more compressible using arithmetic coding. QBT-I transforms most frequent quad-bytes; i.e.4-byte integers. Dictionary of frequent quad-bytes is divided in a group of 256 quad-bytes. Each quad- byte in the dictionary is encoded using two tokens: group number and the location in a group. Group number is denoted using variable length codeword; whereas location within a group is denoted using 8-bit index. QBT-I can be applied on any source; not necessarily text or image or audio. QBT-I is expected to be faster due to 32-bit integer comparison which is faster than 4-byte pattern matching. QBT-I may also compress data along with transformation.

Introduction

Data transformation means transforming data from one format to another. When data transformation is applied as a pre-stage to conventional compression, the main purpose of a data transformation is to re-structure the data such that the transformed file is more compressible by a

second-stage conventional compression algorithm. Thus, the intent is to use the paradigm to improve the overall compression ratio in comparison with what could have been achieved by using only the compression algorithm.

Arithmetic coding [8, 11, 30] is the most widely preferred efficient entropy coding technique providing optimal entropy. Most of the data compression algorithms like LZ algorithms [22, 28, 31, 32]; DMC (Dynamic Markov Compression) [2, 4]; PPM [15] and their variants such as PPMC, PPMD and PPMD+ and others, context-tree weighting method [29], Grammarbased codes [9] and many methods of image compression, audio and video compression transforms data first and then apply entropy coding in the last step. Earlier- generation image and video coding standards such as JPEG, H.263, and MPEG-2, MPEG-4 relied heavily on Huffman coding for the entropy coding steps in compression; but recent generation standards including JPEG2000 [5, 24] and H.264 [12, 26] utilize arithmetic coding.

With arithmetic coding, further improvement in compression is not possible due to entropy limitations. Better compression can be achieved if the data can be transformed to be more skewed.

Here, we have proposed Quad-Byte Transformation using Index (QBT-I) method with an intention to introduce more redundancy in the data and make it more compressible using arithmetic coding at the second stage as shown in figure 1. Both the transformation and compression algorithms are lossless.

QBT-I transforms most frequent quad-bytes (4- byte integer) using indexes from the dictionary of most frequent quad-bytes.

Figure 1Data Transformation before Compression

In general, due to two-stage process of transformation and then compression, compression process is somewhat slower in performance. However, this slowness in the run-time performance is acceptable since the transform will truly skew the data source to allow more effective compression. QBT-I can help to solve this speed problem as follows:
- Quad-byte transformation needs less number of transformations as compared to 1 to 3 byte transformations
- Integer comparison is faster as compared to 4- byte pattern matching
  
  Main advantage of QBT-I is that it is not intended to specific type of data. It can be applied to any source.
Literature Review

Most of the research work in data transformation is intended to compress specific type of files like text, image, audio etc. Transformation techniques like DCT and wavelet are used for image files. After transformation, entropy encoding is applied in the final stage. Following research work to transform data is intended to compress text files.
- Burrows Wheeler Transform (BWT) described in [3, 16] performs block encoding.
- Star family transformation encodes words of different length. According to Dictionary- Based Multi-Corpora Text Compression system by Weifeng Sun, Amar Mukherjee, Nan Zhang [23], to gain a much better compression performance for the backend data compression algorithm, only letters [a..z, A..Z] are used to present the codeword. Star Transform [10], Length Index Preserving Transform (LIPT) [1, 17], and StarNT [23] are some of the transformation techniques that fall in this category.
- Transformation using index position of words in dictionary is performed by Intelligent Dictionary Based Encoding (IDBE) [21], Enhance Intelligent Dictionary Based Encoding (EIDBE) [19] and Improved
  
  Intelligent Dictionary Based Encoding (IIDBE)
  [20] methods.
- Two-byte transformation is applied in BPE (Byte Pair Encoding) [7], digram encoding and ISSDC (Iterative Semi-Static Digram Coding) [14].
- LZ family of algorithms fall in the category of techniques known as Sliding Window based techniques.
  
  Even though above techniques are intended for text files, methods like BPE and digram encoding can be applied to any type of source. But, they will benefit more when applied to small-alphabet source like text files.
  
  Many of the present day transformation techniques, along with transforming data, may introduce some compression also. Additionally they retain enough context and redundancy for compression algorithms to be beneficial.
  Byte Pair Encoding (BPE) presented by P Gage
  [7] is a simple universal text compression scheme based on the 2-byte pattern-substituton.
  
  Byte pair encoding works by finding the most common pair of bytes in a file, and replacing all such pairs with a single unused byte. This substitution information is also stored with the compressed data. This process is repeated using the output of previous iteration as an input. BPE encoder stops when either no byte-pair occurs more than once or no unused characters are left.
  
  BPE decodes most frequent byte-pairs using unused symbols in the source file, thus using 8-bits per 16-bit byte-pair. If there are no zero-frequency (i.e. unused) symbols, it will not be beneficial.
Research Scope

Transformation methods Star encoding, LIPT, StarNT use letters of English alphabet (a..z, A..Z) in the codeword. Methods IDBE, EIDBE, IIDBE can exploit the unused symbols like ASCII values 129 to 255 in text source files for most frequent or longer text patterns. All these methods are using dictionary of words or patterns and requires better data structures and pattern matching algorithms for efficiency.

BWT is very slow due to the need of rotations, sorting and mapping. It gives good compression only when later applied sequence of MTF, RTF and entropy encoding.

Byte-pair encoding, digram encoding and its variations can be applied to any type of source, but they will be beneficial only for small-alphabet source files like text.

Digram encoding and its variation ISSDC give better compression only when the source alphabet is small. If all 256 1-byte symbols are used in the source, the dictionary size needs to be longer than

256 words and each 1-byte character will be encoded using more than 8 bits. ISSDC has an additional drawback of the entire source to be in memory due to its two-pass multi-iterations. So, it can be applied to small size source files.

BPE decodes most frequent byte-pair using unused symbol in the source file, thus using only 8- bits per 16-bit byte-pair. It is also a repetitive process, repeating till there are no unused symbols or no repeated symbols. Thus BPE is beneficial

only when the source is having some unused symbols in it.

As mentioned before, arithmetic coding is the widely used entropy encoding method used with most of the compression methods. So, we saw a research scope here to have transformation method that can be applied to any type source data and introduce redundancy to skew the distribution for getting better compression using arithmetic coding later. There is also a scope to decrease the transformation time.

Proposed Transformation Method QBT-I

Proposed method QBT-I transforms 4-byte integers. It has following advantages over existing methods:

It can be used with any type of source; not limited to text.

Data transformation is expected to be faster due to following:

No pattern matching algorithms are needed during transformation or inverse transformation. Integer comparison is faster to execute as compared to matching a pattern of 4 bytes.

4-byte transformation needs fewer transformations.

QBT-I transforms most frequent quad-bytes. It uses Index based transformation. QBT-I first prepares the dictionary of quad-bytes sorted in decreasing order of their occurrence. The dictionary is then logically divided into groups of 256 quad- bytes. Number of groups may vary and can be specified by a user. If number of groups is nGrp, then the dictionary size is to accommodate (256 x nGrp) quad-bytes.

Each quad-byte found in the dictionary is then encoded using two tokens; group number and the location of quad-byte within a group. Group number is denoted using variable length prefix codeword and location is denoted using 8-bit index. Quad-bytes in different groups may be at same location index. Thus, redundancy is introduced due to 8-bit index codeword denoting the location of quad-bytes. More the number of groups; more is the redundancy and better is the compression achieved using arithmetic coding later.

For decoder, it needs to know whether it has to reverse transform the quad-byte or not. Assuming the worst case of majority of integers not available in the dictionary, encoder uses group codeword 0 to notify that quad-byte integer is not transformed. As explained in Table 1, variable length prefix code starting with bit 1 denotes that an integer is found

in the dictionary and is encoded using the index position within a group.

Thus, a quad-byte integer is transformed using two components <group code, index code>.

Group code starting with 0 implies no transformation, with as many 1s as the number of groups implies the quad-byte from the last group and otherwise it implies quad-byte in other groups.

For quad-bytes found in dictionary, 8-bit index codeword will introduce redundancy in the dataset. To exploit redundancy at the time of compressing data using arithmetic coding, it is advisable to keep group code and index code separate in a file or in files.

Use of variable length code helps to reduce the size of transformed file. Here most frequent codes are in the initial groups and are assigned shorter prefix code. Shortest prefix code 0 is used for untransformed integers assuming smaller dictionary size. Smaller dictionary sizes will speedup the search process.

Table 1 Prefix Code and Index Code for Quad- byte found in Dictionary

Group 1 Prefix code: If last group, (1)2 If not last group, (10)2	Integer Data	D0	D1	D2	…	D255
	Location	0	1	2	…	255
	Index Codeword	0	1	2	…	255

Group 2 Prefix code: If last group, (11)2 If not last group, (110)2	Integer Data	D256	D257	D258	…	D511
	Location	256	257	258	…	511
	Index Codeword	0	1	2	…	255

Group 3 Prefix code: If last group, (111)2 If not last group, (1110)2	Integer Data	D512	D513	D514	…	D767
	Location	512	513	514	…	767
	Index Codeword	0	1	2	…	255
Â·Â·

Conclusion

In this paper, our hypothesis is that the proposed method QBT-I of quad-byte data transformation will give better compression when used at pre- processing stage with arithmetic coding and will take relatively less time due to the need of fewer transformations and use of integer comparison instead of pattern matching.
References

F. D. Awan, N. Zhang, N. Motgi, R. T. Iqbal, A. Mukherjee. LIPT: A reversible lossless text transform to improve compression performance, Proceedings of the IEEE Data Compression Conference (DCC2001), pp. 481, March 2729, 2001
T.C. Bell, A. Moffat, A Note on the DMC Data Compression Scheme, Computer Journal, vol. 32(1), pp.16-20, 1989
M. Burrows,D. J. Wheeler. A block-sorting lossless data compression algorithm, Digital Systems Research Center, Research Report 124, Digital Equipment Corporation, Palo Alto,

California, May 10, 1994
G.V. Cormack, R.N. Horspool, Data Comressing Using Dynamic Markov Modeling, Computer Journal, vol. 30(6), pp.541-550, 1987
M. Dyer,D. Taubman, S. Nooshabadi, Improved throughput arithmetic coder for JPEG2000, Proc. Int. Conf. Image Process., Singapore, pp. 2817 2820, Oct. 2004
R. Franceschini, H. Kruse, N. Zhang, R. Iqbal, A. Mukherjee, Lossless, Reversible Transformations that Improve Text Compression Ratio, Project paper, University of Central Florida, USA, 2000
Philip Gage, "A New Algorithm For Data Compression", The C Users Journal, vol. 12(2)2, pp. 2338, February 1994
P. G. Howard, J. S. Vitter, "Arithmetic coding for data compression", Proc. IEEE. , vol.82: pp.857- 865, 1994
J. C. Kieffer, E. H. Yang, Grammar-based codes: A new class of universal lossless source codes, IEEE Trans. Inform. Theory, vol. 46, pp. 737754, 2000
H. Kruse, A. Mukherjee. Preprocessing Text to Improve Compression Ratios,Proc. Data Compression Conference, pp. 556, 1998
G. Langdon, "An introduction to arithmetic coding", IBM Journal Research and Development, vol. 28, pp. 135-149, 1984
Detlev Marpe, Heiko Schwarz, Thomas Wiegand,

Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard, IEEE Trans. On Circuits and Systems for Video Technology, vol. 13(7), pp. 620-636, July 2003
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2),

PP.262272, 1976
Altan Mesut, Aydin Carus, ISSDC: Digram Coding Based Lossless Dtaa Compression Algorithm, Computing and Informatics, Vol. 29, pp.741754, 2010
Moffat, Implementing the PPM Data Compression Scheme, IEEE Transactions on Communications, vol.38, pp.1917-1921, 1990
M. Nelson, "Data Compressin with the Burrows- Wheeler Transform", Dr. Dobb's Journal, pp. 46-50, Sept 1996 available at http://marknelson.us/1996/09/01/bwt/
Radescu R., "Lossless Text Compression Using the LIPT Transform", Proceedings of the 7th International Conference Communications 2008 (COMM2008), ISBN 978-606-521-008-0., pp. 59- 62, Bucharest, Romania, 5-7 June 2008
Rexline S.J, Robert L. Dictionary Based Preprocessing Methods in Text Compression – A Survey, International Journal of Wisdom Based Computing, vol. 1(2), August 2011
Senthil S, Robert L, Text Preprocessing using Enhanced Intelligent Dictionary Based Encoding (EIDBE), Proceedings of Third International Conference on Electronics Computer Technology, pp.451-455, Apr 2011
Senthil S, Robert L, "IIDBE: A Lossless Text Transform for Better Compression", International Journal of Wisdom Based Computing, vol. 1(2),

August 2011
Shajeemohan B.S, Govindan V.K, "Compression scheme for faster and secure data transmission over networks", IEEE Proceedings of the International conference on Mobile business, 2005
Storer J. A., Szymanski T. G., "Data Compression via Textual Substitution", Journal of ACM Vol. 29(4), pp. 928-951, Oct 1982
W. Sun, A. Mukherjee, N. Zhang, A Dictionary- based Multi-Corpora Text compression System, Proceedings of the 2003 IEEE Data Compression Conference, March 2003
S. Taubman and M. W. Marcellin, "JPEG2000: Image Compression Fundamentals", Standards and Practice. Norwell, MA: Kluwer Academic, 2002
Ukkonen. On-line construction of suffix tress,

Algorithmica, vol. 14(3), pp. 249260, 1995
T. Wiegand, G. Sullivan, G. Bjontegaard, A. Luthra,

Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst.Video Technol., vol. 13(7), pp. 560576, Jul 2003
P. Weiner, Linear pattern matching algorithms, In Proceedings of the 14th IEEE Annual Symposium on Switching and Automata Theory, The University of Iowa, pp.111, 1973.
T. Welch, A Technique for High-Performance Data Compression, IEEE Computer, vol. 17(6), pp. 8- 19, June 1984
M. J. Willems, Y. M. Shtarkov, T. J. Tjalkens, The context-tree weighting method: Basic properties, IEEE Trans. Inform. Theory, vol.41, pp. 653664,

May 1995
H. Witten, R. M. Neal, J. G. Cleary, Arithmetic coding for data compression, Commun. ACM, vol. 30(6), pp. 520540, 1987
J. Ziv, A. Lempel, "Compression of individual sequences via variable rate coding", IEEE Transactions on Information Theory, IT-24(5), pp.530-536, 1978
J. Ziv, A. Lempel. A Universal Algorithm for Sequential Data Compression, IEEE Trans. Information Theory, IT-23, pp.337-343, 1977

	BWT	Star encoding	LIPT	StarN T	IDBE	EIDBE	IIDBE	Digram encodin g	ISSDC	BPE
Source Type	Any	Text	Text	Text	Text	Text	Text	Any	Any	Any
Dictionary	—	static, 22 sub-dict	static, 22 sub- dict	semi- static, single	semi- static	semi- static	semi- static	semi- static	semi- static	—
Size of token to be encoded	block	word upto 22 letters	word upto 22 letters	Word	Word	word	word	digram	digram	2 bytes
Matching	string	string	string	String	String	string	string	string or integer	string or integer	string or integer
comparison time per token	high	O(Sub- Dict- size)	O(Sub -Dict- size)	O(Dict – size)	O(Dict- size)	O(Dict- size)	O(Dict- size)	O(Dict- size)	O(Dict- size)	single 2- byte compariso n
Code length	For Block	variable length: word- size	variabl e length: <*, word length, index>	variabl e length: index with max. 3- letters	variable length: <1-byte codeword length, codeword >	variable length: <1-byte word length, codeword >	variable length: <1-byte codeword length, codeword >	fixed, depends on dictionar y size	fixed, depends on dictionar y size	1 byte
Redundanc y using	Index	*	index, length	index, length	index, length	index, length	index, length	index	index	substitutio n
Compresssi on methods that can be applied later	MTF, RLE and then Huffman or arithmeti c coding	RLE, LZW, Huffman, Arithmet ic coding	Huffman or Arithmetic Coding		Pre-processing to BWT, Later MTF and RLE and entropy encoding			Huffman or Arithmetic coding

Drawback	needs	only for text source	benefits	repetitiv	repetitiv,
	better		only	e,	benefits
	data		with	benefits	only
	structure		small-	only	when
	s for		alphabet	with	source
	sorting,		source	small	have
	comparin			size	some
	g			source	unused
				file and	symbols,
				small	i.e. for
				alphabet	small
				source	alphabet
					source

Volume 02, Issue 12 (December 2013)

Quad-Byte Transformation as a Pre-Processing to Arithmetic Coding

Quad-Byte Transformation as a Pre-Processing to Arithmetic Coding

BWT

StarTransform

Length Index Preserving Transform (LIPT)

Star New Transformation (StarNT)

Intelligent Dictionary Based Encoding (IDBE)

Digram Coding

Byte-Pair Encoding

Leave a Reply