Stochastic chains with memory of variable length
Stochastic chains with memory of variable length are a family of stochastic chains of finite order in a finite alphabet, such as, for every time pass, only one finite suffix of the past, called context, is necessary to predict the next symbol. These models were introduced in the information theory literature by Jorma Rissanen in 1983,[1] as a universal tool to data compression, but recently have been used to model data in different areas such as biology,[2] linguistics[3] and music.[4]
Definition
A stochastic chain with memory of variable length is a stochastic chain Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (X_n)_{n\in Z}} , taking values in a finite alphabet Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} , and characterized by a probabilistic context tree Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\tau,p)} , so that
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau} is the group of all contexts. A context Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{n-l},\ldots,X_{n-1}} , being the size of the context, is a finite portion of the past Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{-\infty},\ldots,X_{n-1}} , which is relevant to predict the next symbol ;
- is a family of transition probabilities associated with each context.
History
The class of stochastic chains with memory of variable length was introduced by Jorma Rissanen in the article A universal data compression system.[1] Such class of stochastic chains was popularized in the statistical and probabilistic community by P. Bühlmann and A. J. Wyner in 1999, in the article Variable Length Markov Chains. Named by Bühlmann and Wyner as “variable length Markov chains” (VLMC), these chains are also known as “variable-order Markov models" (VOM), “probabilistic suffix trees”[2] and “context tree models”.[5] The name “stochastic chains with memory of variable length” seems to have been introduced by Galves and Löcherbach, in 2008, in the article of the same name.[6]
Examples
Interrupted light source
Consider a system by a lamp, an observer and a door between both of them. The lamp has two possible states: on, represented by 1, or off, represented by 0. When the lamp is on, the observer may see the light through the door, depending on which state the door is at the time: open, 1, or closed, 0. such states are independent of the original state of the lamp.
Let a Markov chain that represents the state of the lamp, with values in and let be a probability transition matrix. Also, let be a sequence of independent random variables that represents the door's states, also taking values in , independent of the chain and such that
where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 0 < \epsilon < 1} . Define a new sequence Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (Z_n)_{n \ge 0}} such that
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z_n = X_n \xi_n} for every Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (Z_n)_{n \ge 0}.}
In order to determine the last instant that the observer could see the lamp on, i.e. to identify the least instant Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} , with in which Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Z_k=1} .
Using a context tree it's possible to represent the past states of the sequence, showing which are relevant to identify the next state.
The stochastic chain Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (Z_n)_{n\in\mathbb{Z}}} is, then, a chain with memory of variable length, taking values in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} and compatible with the probabilistic context tree Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\tau,p)} , where
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau = \{1,10,100,\cdots\} \cup \{0^\infty\}.}
Inferences in chains with variable length
Given a sample Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{l},\ldots,X_{n}} , one can find the appropriated context tree using the following algorithms.
The context algorithm
In the article A Universal Data Compression System,[1] Rissanen introduced a consistent algorithm to estimate the probabilistic context tree that generates the data. This algorithm's function can be summarized in two steps:
- Given the sample produced by a chain with memory of variable length, we start with the maximum tree whose branches are all the candidates to contexts to the sample;
- The branches in this tree are then cut until you obtain the smallest tree that's well adapted to the data. Deciding whether or not shortening the context is done through a given gain function, such as the ratio of the log-likelihood.
Be Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{0},\ldots,X_{n-1}} a sample of a finite probabilistic tree Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\tau,p)} . For any sequence with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle j \leq n} , it is possible to denote by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_n(x_{-j}^{-1})} the number of occurrences of the sequence in the sample, i.e.,
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N_n(x_{-j}^{-1}) = \sum_{t=0}^{n-j} \mathbf{1}\left\{X_t^{t+j-1} = x_{-j}^{-1}\right\}}
Rissanen first built a context maximum candidate, given by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_{n-K(n)}^{n-1}} , where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K(n)=C\log{n}} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C} is an arbitrary positive constant. The intuitive reason for the choice of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C\log{n}} comes from the impossibility of estimating the probabilities of sequence with lengths greater than Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \log{n}} based in a sample of size .
From there, Rissanen shortens the maximum candidate through successive cutting the branches according to a sequence of tests based in statistical likelihood ratio. In a more formal definition, if bANnxk1b0 define the probability estimator of the transition probability Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p} by
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{p}_n(a\mid x_{-k}^{-1}) = \frac{N_n(x_{-k}^{-1}a)}{\sum_{b \in A} N_n (x_{-k}^{-1} b)}}
where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_{-j}^{-1}a=(x_{-j}, \ldots, x_{-1},a)} . If Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_{b \in A}N_n(x_{-k}^{-1}b) \,=\,0} , define Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{p}_n(a\mid x_{-k}^{-1}) \,=\, 1/|A|} .
To Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i \geq 1} , define
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Lambda_n (x_{- i }^{-1} ) \,=\, 2 \, \sum_{y \in A} \sum_{a \in A} N_n(y x_{-i}^{-1}a) \log\left[\frac{\hat{p}_n(a\mid x_{-i}^{-1} y)} {\hat{p}_n(a\mid x_{-i}^{-1})} \right]\,}
where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y x_{-i}^{-1}=(y,x_{-i}, \ldots , x_{-1})} and
Note that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Lambda_n (x_{- i }^{-1})} is the ratio of the log-likelihood to test the consistency of the sample with the probabilistic context tree Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\tau,p)} against the alternative that is consistent with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\tau',p')} , where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau'} differ only by a set of sibling knots.
The length of the current estimated context is defined by
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{\ell}_n(X_0^{n-1})= \max \left\{i=1,\ldots, K(n): \Lambda_n (X_{n-i}^{n-1}) \,>\, C \log n \right\}\,}
where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C} is any positive constant. At last, by Rissanen,[1] there's the following result. Given Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X_0,\ldots, X_{n-1}} of a finite probabilistic context tree Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\tau,p)} , then
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P\left( \hat{\ell}_n(X_0^{n-1}) \neq \ell(X_0^{n-1}) \right) \longrightarrow 0,}
when Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n \rightarrow \infty} .
Bayesian information criterion (BIC)
The estimator of the context tree by BIC with a penalty constant Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle c>0} is defined as
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{\tau}_\mathrm{BIC}=\underset{\tau \in \mathcal{T}_n}{\arg \max}\{\log L_\tau (X_1^n)-c\,\textrm{d}f (\tau)\log n \}}
Smallest maximizer criterion (SMC)
The smallest maximizer criterion[3] is calculated by selecting the smallest tree τ of a set of champion trees C such that
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lim_{n\to \infty} \frac{\log L_\tau (X^n_1) - \log L_{\hat{\tau}}(X^n_1)}{n} = 0 }
See also
References
- ^ 1.0 1.1 1.2 1.3 Rissanen, J (Sep 1983). "A Universal Data Compression System". IEEE Transactions on Information Theory. 29 (5): 656–664. doi:10.1109/TIT.1983.1056741.
- ^ 2.0 2.1 Bejenaro, G (2001). "Variations on probabilistic suffix trees: statistical modeling and prediction of protein families". Bioinformatics. 17 (5): 23–43. doi:10.1093/bioinformatics/17.1.23. PMID 11222260.
- ^ 3.0 3.1 Galves A, Galves C, Garcia J, Garcia NL, Leonardi F (2012). "Context tree selection and linguistic rhythm retrieval from written texts". The Annals of Applied Statistics. 6 (5): 186–209. arXiv:0902.3619. doi:10.1214/11-AOAS511.
- ^ Dubnov S, Assayag G, Lartillot O, Bejenaro G (2003). "Using machine-learning methods for musical style modeling". Computer. 36 (10): 73–80. CiteSeerX 10.1.1.628.4614. doi:10.1109/MC.2003.1236474.
- ^ Galves A, Garivier A, Gassiat E (2012). "Joint estimation of intersecting context tree models". Scandinavian Journal of Statistics. 40 (2): 344–362. arXiv:1102.0673. doi:10.1111/j.1467-9469.2012.00814.x.
- ^ Galves A, Löcherbach E (2008). "Stochastic chains with memory of variable length". TICSP Series. 38: 117–133. arXiv:0804.2050.