Multi-Stream Word-Based Compression Algorithm

Ozturk, EmirMesut, AltanDiri, Banu2024-06-122024-06-122017978-1-5386-0930-9https://hdl.handle.net/20.500.14551/203372017 International Conference on Computer Science and Engineering (UBMK) -- OCT 05-08, 2017 -- Antalya, TURKEYIn this article, we present a novel word-based lossless compression algorithm for text files which uses a semi-static model. We named our algorithm as Multi-stream Word-based Compression Algorithm (MWCA), because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text. It also stores two dictionaries and a bit vector as a side information. In our experiments MWCA obtains compression ratio over 3,23 bpc on average and 2,88 bpc on files larger than 50 MB. If a variable length encoder like Huffman Coding is used after MWCA, given ratios will reduce to 2,63 and 2,44 bpc respectively. With the advantage of its multi-stream structure MWCA could become a good solution especially for storing and searching big text data.trinfo:eu-repo/semantics/closedAccessData CompressionText CompressionNatural-Language TextMulti-Stream Word-Based Compression AlgorithmConference Object3437N/AWOS:0004268569000072-s2.0-85040605764N/A