Lempel-Ziv Parsing for Sequences of Blocks

Dmitry Kosolobov and Daniel Valenzuela

Resumen

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other ?elementary? letters)? In this paper, we prove that, for any integer ??>1 b > 1 , the number z of phrases in the LZ77 parsing of a string of length n and the number ???? z b of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., ??=8 b = 8 in case of bytes) are related as ????=??(????log????) z b = O ( b z log n z ) . The bound holds for both ?overlapping? and ?non-overlapping? versions of LZ77. Further, we establish a tight bound ????=??(????) z b = O ( b z ) for the special case when each phrase in the LZ77 parsing of the string has a ?phrase-aligned? earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods.

Palabras claves

LZ77 - blocks - Lempel-Ziv - SLP - grammar

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 14 Parte: 12 (2021)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

IEEE TRANSACTIONS ON INFORMATION THEORY

DOI