Сжатие с помощью грамматических моделей

>> Русские материалы | Английские материалы

Смотрите также материалы:
- Предсказание по частичному совпадению (PPM)
- Контекстные методы сжатия (без PPM)
- Сжатие текстов
- Моделирование естественных языков
- Классификация и разметка текстов с использованием методов сжатия данных
- Обзоры универсальных алгоритмов сжатия данных
- Книга "Методы сжатия данных". Раздел 1 "Методы сжатия без потерь"



>> Русские материалы | Английские материалы
Авторы Название Описание Рейтинг
Курапова Е.В., Рябко Б.Я. Применение формальных грамматик при кодировании источников информации Кодирование файлов баз данных и библиотек программ с помощью модели источника, соответствующей заданной контекстно-свободной грамматике. Текущее грамматическое состояние обуславливает использование соответствующего ему кода, т.е. применяется контекстное моделирование в чистом виде.
Проблемы передачи информации, том 31, вып.1, 1995. С28-32.
PDF  344 кбайт
4


>> Русские материалы | Английские материалы
Lehman E. Approximation Algorithms for Grammar-Based Data Compression This thesis considers the smallest grammar problem: find the smallest context-free grammar that generates exactly one given string. We show that this problem is intractable, and so our objective is to find approximation algorithms. This simple question is connected to many areas of research. Most importantly, there is a link to data compression; instead of storing a long string, one can store a small grammar that generates it. A small grammar for a string also naturally brings out underlying patterns, a fact that is useful, for example, in DNA analysis. Moreover, the size of the smallest context-free grammar generating a string can be regarded as a computable relaxation of Kolmogorov complexity... In this thesis, we establish hardness results, evaluate several previously proposed algorithms, and then present new procedures with much stronger approximation guarantees.
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, February 2002.
PDF  682 кбайт
PS.RAR    334 кбайт
?
Bach J., Witten I.H. Lexical Attraction for Text Compression Некоторые наблюдения о зависимости слов в тексте от несмежных с ними слов. Любопытно, но до практичной реализации очень далеко.
New methods of acquiring structural information in text documents may support better compression by identifying an appropriate prediction context for each symbol. The method of "lexical attraction" infers syntactic dependency structures from statistical analysis of large corpora. We describe the generation of a lexical attraction model, discuss its application to text compression, and explore its potential to outperform fixed-context models such as word-level PPM. Perhaps the most exciting aspect of this work is the prospect of using compression as a metric for structure discovery in text.
Proceedings of the 1999 IEEE Data Compression Conference, Snowbird, Utah, March 1999.
HTML  54 кбайт
PDF  69 кбайт
PS.RAR    50 кбайт
4
Charikar M., Lehman E., Liu D., Panigrahy R., Prabhakaran M., Rasala A., Sahai A., Shelat A. Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models We consider the problem of finding the smallest context-free grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar is known to be hard to approximate to within a constant factor... Our main result is an exponential improvement of this ratio; we give an O(log(n/g*)) approximation algorithm, where g* is the size of the smallest grammar...
February 20, 2002
PDF  166 кбайт
?
Bookstein A., Klein Sh. Compression, Information Theory and Grammars: A Unified Approach ...We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure...
Center for Information and Language Studies, University of Chicago, January 1990.
PDF  297 кбайт
PS.RAR    79 кбайт
5
Hutchens J.L. The UpWrite Predictor: A General Grammatical Inference Engine for Symbolic Time Series with Applications in Natural Language Acquisition and Data Compression This dissertion is concerned with the development of a general Grammatical Inference Engine for symbolic time series -- a device which is capable of automatically constructing a predictive model from arbitrary symbolic data. We are particularly interested in the situation where the data is natural language text... The UpWrite Predictor may potentially be of use in many applications, and we explore one of these, data compression, in this dissertation. We show that the performance of standard PPMC data compressor may be improved by using the UpWrite Predictor to abstract the data prior to compression taking place. During our exploration of the data compression problem, we introduce several modifications and additions to traditional PPM technique...
The University of Western Australia, December 1999.
PDF  1432 кбайт
PS.RAR  417 кбайт
?
Nevill-Manning C. Inferring Sequential Structure Диссертация по SEQUITUR -- алгоритму автоматического адаптивного построения контекстно-свободных грамматик.
Structure exists in sequences ranging from human language and music to the genetic information encoded in our DNA. This thesis shows how that structure can be discovered automatically and made explicit... We develop simple, robust techniques that reveal interesting structure in a wide range of real-world sequences including music, English text, descriptions of plants, graphical figures, DNA sequences, word classes in language, a genealogical database, programming languages, execution traces, and diverse sequences from a data compression corpus. These techniques are useful: they produce comprehensible visual explanations of structure, identify morphological units and significant phrases, compress data, optimise graphical rendering, infer recursive grammars by recognising self-similarity, and infer programs.
University of Waikato, May 1996.
PDF  787 кбайт
5
Nevill-Manning C., Witten I., Maulsby D. Compression by Induction of Hierarchical Grammars Использование адаптивно строящихся контекстно-свободных грамматик для сжатия данных. Описываемый алгоритм позднее превратился в алгоритм SEQUITUR.
Computer Science, University of Waikato, Hamilton, New Zealand, 1994.
PDF  44 кбайт
5
Nevill-Manning C., Witten I. Olsen D. Compressing semi-structured text using hierarchical phrase identification И еще статья про SEQUITUR.
...This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of “semi-structured,” we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem...
Computer Science, University of Waikato, Hamilton, New Zealand, 1996.
PDF  53 кбайт
?
Nevill-Manning C., Witten I. Online and offline heuristics for inferring hierarchiies of repetitions in sequences И еще одна лебединая песня про SEQUITUR.
Hierarchical dictionary-based compression schemes form a grammar for a text be replacing each repeated string with a production rule... In this paper, we compare the online method with three offline heuristics for selecting the next substring to replace: longest string first, most common string first, and the string that minimizes the size of the input... We evaluate each technique on artificial and natural sequences. In general, the locally-most-compressive heuristic performs best, followed by most frequent, the online technique, and, lagging by some distance, the longest-first technique.
2000
PDF  584 кбайт
?
Evans W. Compression via Guided Parsing Не особо удачные эксперименты по сжатию исходников с помощью разбиения на потоки лексем в соответствии с правилами контекстно-свободной грамматики языка. Автор исследовал сжатие с потерями побочной информации (форматирование, комментарии).
1997
PDF  183 кбайт
PS.RAR  66 кбайт
3
Kieffer J., Yang E. Grammar Based Codes: A New Class of Universal Lossless Source Codes We investigate a type of lossless source code called a grammar based code, which, in response to any input data string x over a fixed finite alphabet, selects a context-free grammar Gx representing x in the sense that x is the unique string belonging to the language generated by Gx. Lossless compression of x takes place indirectly via compression of the production rules of the grammar Gx. It is shown that, subject to some mild restrictions, a grammar based code is a universal code with respect to the family of finite state information sources over the finite alphabet. Redundancy bounds for grammar based codes are established. Reduction rules for designing grammar based codes are presented.
2000
PDF  358 кбайт
PS.RAR  98 кбайт
3
Lake M. Prediction by Grammatical Match Довольное туманное описание авторского алгоритма PGM (Prediction by Grammatical Match), позволяющего объединить парсер на основе заданных левосторонних грамматик с PPM. В итоге для малых порядков модели получается обогнать PPM по сжатию и используемой памяти.
Proceedings of the IEEE Data Compression Conference, Snowbird, Utah, March 28-30, 2000.
PDF.RAR  135 кбайт
3

Смотрите также материалы:
- Предсказание по частичному совпадению (PPM)
- Контекстные методы сжатия (без PPM)
- Сжатие текстов
- Моделирование естественных языков
- Классификация и разметка текстов с использованием методов сжатия данных
- Обзоры универсальных алгоритмов сжатия данных
- Книга "Методы сжатия данных". Раздел 1 "Методы сжатия без потерь"


наверх