# Recursive Bayesian Estimation based on Stochastic Context-free Grammar

01 Mar 2014

## Time series analysis

Many machine learning and statistic problem can be generalized as an approach to estimate one or series unknown probability distributions based on some known facts. In a typical time series scenario, the observed data have natural temporal order which the data will become available one by one through time. In order to take advantage of such ordering, the ideal estimation should performed recursively following the time. The model as well as the embedded parameters will be updated incrementally through time. And most important applications include filtering, predication, and smoothing can be derived under this model.

Mathematically, unless the underlying time series is constant, people used to assume it as a stochastic process. Many statistic models are proposed to try better characterize the process. The most basic model only consider the level values. Typically such model assume the current sample have linear dependence with previous sample plus some white noise. Three broad classes of practical importance are the autoregressive (AR) models, the integrated (I) models, and the moving average (MA) models. Based on those models, there also have some combination models commonly used in practical applications include ARMA, ARIMA etc.

Beyond the linear relationship, there are models to represent the change of variance instead of values over time. For example, the volatility clustering in financial time series tend to have heavy dependency among volatility "large changes tend to be followed by large changes, of either sign, and small changes tend to be followed by small changes". In order to capture and quantify the additional variances correlation over time (heteroskedasticity), researchers propose more advanced autoregressive conditional heteroskedasticity (ARCH) model.

Inspired from heteroskedasticity model, instead of modeling the observations directly, the more advanced model starts to capture the indirect statistics. Such changing variance, though hidden from our observations, could also be modeled by some stochastic state process. By allowing more hidden control variables as the system state, state space model can be used to generalized such assumptions. The most common state space model on practical usage is hidden Markov model which assume the unobserved (hidden) states is a Markov process.

## Recursive Bayesian estimation

In most statistical models, the parameterized states to be modeled are posterior probability density functions conditioned on observations. Specifically in time series, our mission is seeking a solution to estimate such pdf in recursive manner. This optimal solution is typically intractable unless in linear assumption. However, in most cases, it have optimal substructure over the time index. Therefore, it is possible to construct an efficient numerical solution through dynamical programming.

Once we go towards to state space model, we already assume their are two layers in the system. One is the observations $$\mathbf{O}$$ which are input variables that we can directly observed from outside. And the other one is the hidden states $$\mathbf{X}$$ which are assumed generate $$\mathbf{O}$$ in the backyard. Therefore, once we have $$\mathbf{O}$$ in hand, we could based on that to infer who is the hidden state $$\mathbf{X}$$ generate such $$\mathbf{O}$$ probabilistically.

Notice that both $$\mathbf{O}$$ and $$\mathbf{X}$$ are time indexed process with sort of dependency on series. So the inference of $$X_i$$ is not only depend on $$O_i$$, but also depend on the past $$X_{1:i-1}$$ and $$O_{1:i-1}$$. Typically we assume the dependency between $$X_i$$ and $$O_{1:i-1}$$ are go through $$X_{1:i-1}$$. Hence $$\mathbf{X}$$ are conditional independent between each other. Another assumption is $$\mathbf{X}$$ have Markov property, which leads to $\mathbf{P}(X_k | X_{1:k-1}) = \mathbf{P}(X_k | X_{k-1})$

In above model, our goal is to recursively estimate $$\mathbf{X}$$ given $$\mathbf{O}$$ with the time $$t$$ from $$1$$ to $$n$$.

So let's start from scratch. Initially, when $$t = 0$$, we have no observations. Therefore, the only thing we could do is guess the next hidden state based on the prior knowledge $\mathbf{P}(X_1) = \pi$ Then, $$t$$ move on to $$1$$, we observe our first observation $$O_1$$. The Bayesian posterior probability by adding this new information is, $\mathbf{P}(X_1 | O_1) = \frac{\mathbf{P}(O_1 | X_1) \mathbf{P}(X_1)}{ \sum_{x} \mathbf{P}(O_1|x)\mathbf{P} (x)}$ following that, we are also curious a little bit more about the next state based on the newly updated observations. Because of Markov property, the probability of next state $$X_2$$ only depends on $$X_1$$. $\mathbf{P}(X_2 | O_1) = \sum_{x_1} \mathbf{P}(X_2 | x_1) \mathbf{P}(x_1 | O_1)$

We could follow above computation iteratively through time. Essentially, for each time step, there are two steps involve in the iteration,

• Predict prior probability $\mathbf{P} (X_{k+1} | O_{1:k}) = \sum_{x_k} \mathbf{P}(X_{k+1} | x_k) \mathbf{P}(x_k | O_{1:k})$
• Update posterior probability $\mathbf{P} (X_{k+1} | O_{1:k+1}) = \frac{\mathbf{P}(O_{k+1} | X_{k+1}) \mathbf{P}(X_{k+1} | O_{1:k})}{ \sum_{x} \mathbf{P}(O_{k+1}|x)\mathbf{P} (x | O_{1:k})}$

Under different restrictions, the above estimation commonly have two filtering method, Kalman filter and Particle filter.

• Kalman filter. When the process $$\mathbf{X}$$ is Gaussian linear process, by taking advantage of Gaussian property, both prior and posterior calculations have Gaussian close form.
• Particle filter When the process $$\mathbf{X}$$ is not Gaussian linear process, the posterior probability no longer have close form to calculate. Monte Carlo simulation (Sequential importance re-sampling) will be used to estimate the distribution.

## Markov assumption

In predict mode, the next state is predicted by marginalizing the previous belief through Chapman-Kolmogorov equation. In update mode, the current state is updated by balancing the likelihood and predict probability through Bayesian equation.

It is straightforward to update the posterior belief through Bayesian estimator once the prior probability is available. Most current research are concentrating on how to well capture the prior state probability under different assumptions. Generally, the hidden states are assumed have Markov property. Therefore, only one latest state belief is used to marginalized the current one.

• If the relationship between those two states is discrete, then the distribution can be calculated through finite grid based estimation.
• If the relationship is linear Gaussian, then the distribution can be calculated in close form through Kalman filter.
• If the relationship is nonlinear, then the Monte Carlo method is required to estimate such relationship as well as the distribution.

The above three scenarios are well studied, but notice that they all share one strong assumption -Markov property which require that current state can be only determined by the previous one.

Specifically, Markov property characterize a left-most sample derivation (generation) method. From grammar point of view, the Markov model is equivalent to a regular grammar.

Any observations whom generated only depends on the most previous one is good fit for Markov as well as the above recursive Bayesian models. However, there are observations have long term relationship, pair generation and structure co-occurrence. Under these circumstances, instead of using regular grammar, people use context-free model to well characterize the desired relationship.

## Stochastic context-free grammar

Now, we introduce stochastic context-free grammar (SCFG) to model the above real vector emission state space system. The SCFG is define as $$\mathbf{G} = (\mathbf{N},\mathbf{\Sigma} , \mathbf{R}, S , p)$$ where:

• $$\mathbf{N}$$ is a finite set of non-terminal symbols.
• $$\mathbf{\Sigma}$$ is a finite set of terminal Gaussian mixtures.
• $$\mathbf{R}$$ is a finite set of the rules of the form $$X \to Y_1 Y_2 \ldots Y_n$$, where $$X \in \mathbf{N}$$ and $$Y \in (\mathbf{N} \cup \mathbf{\Sigma})$$ for $$i = 1 \ldots n$$.
• $$S \in \mathbf{N}$$ is the start non-terminal.
• $$p$$ is the conditional probability measurement for each rules in $$(\alpha \to \beta) \in \mathbf{R}$$.

Additionally we have following constraints for each nonterminal $$X \in \mathbf{N}$$, $\sum_{\alpha \to \beta \; \alpha = X} p(\alpha \to \beta) = 1$

The SCFG is a generative data model. The sequential observation can be assumed to be generated by doing left-most derivation through $$\mathbf{G}$$. The entire derivation sequence can be further elaborate as a parsing tree. And the probability associated on any parsing tree can be calculated by multiply the entire context-free rules that it contains. $\mathbf{P} (S \overset{\ast}{\to} O)$

Typically, there exist multiple parsing trees for $$\mathbf{G}$$ to generate different terminal sequence. Meanwhile, for given $$\mathbf{O}$$, there also exist different parsing trees to understand the sequence.

Initially, SCFG model is inspired from the field of natural language processing to model human or computer languages through grammar system. It also widely adopted in genetic analysis since DNA/RNA sequence naturally have complicated structure patterns and long-distance pairwise correlations.

In traditional SCFG, the terminal symbols are discrete words, token or nucleobase etc. which all can been matched in input sequence deterministically. On the other hand, the terminal symbols in our proposed model are mixtures against to the real vector input observations. In order to distinguish the state of the non-terminal, it requires probabilistic inference. Therefore, our proposed model is an extension of traditional SCFG with hidden non-terminals and real vector emissions.

### Model evaluation

Evaluating the SCFG for given observation is equivalent to parse $$\mathbf{O}$$ based on $$\mathbf{G}$$.

We define inside probability $$\gamma(i,j,v)$$ measure the likelihood of the entire possible parsing trees rooted by non-terminal $$v$$ given $$O_{i:j}$$. $\gamma(i,j,v) = \sum_{t \in T(v)} \mathbf{P} (v \overset{\ast}{\to} O_{i:j})$

Notice that $$\gamma(1,n,S)$$ is exactly the probability of the grammar $$\mathbf{G}$$ generate observation $$\mathbf{O}$$.

Though both models are generative model, hidden Markov model assume all the outcome generation and hidden states transition starts from left to right. On the contrary, SCFG relax such restriction and allow the outcome generation as well as the hidden terminals and non-terminals transition start from inside to outside. Obviously, left to right is a special instance for inside to outside when the left part of inside bounded to left-most.

In order to demonstrate the benefit of the inside-outside generation pattern, we take long-distance pairwise $$a^nb^n$$ as an example. The traditional left-right model only capture the Markov property which leads to three transition probabilities $$\mathbf{P}(a\to a) = \frac{n-1}{n}$$, $$\mathbf{P}(a\to b) = \frac{1}{n}$$ and $$\mathbf{P}(b\to b) = 1$$.

On the other hand, the context-free model enable the long-term distance pair generation. Specifically, each pair contains $$a$$ and $$b$$. The generation start from inside to outside which leads to two state transition probabilities (context-free rules) $$p (S \to ab) = \frac{1}{n}$$ and $$p (S \to a\;S\;b) = \frac{n-1}{n}$$.

Both models try to understand the observations under their capability. But in this case, the fact of co-occurrence pair $$ab$$ is more favorable.

### Earley parser

There are several parsing methodologies available for context-free grammar. Most of them are based on a generalization of bottom-up chart parsing, such as the CYK algorithm.

Bottom-up strategy has its own advantage, but it typically require all the input strings available before it starts. However, the top-down fashion do not have

The Earley parser is an algorithm for parsing string that belong to a given context-free language. Generally it keep track of all possible derivations that are consistent with the input string up to the current time. With more and more observations become available from left to right, the set of possible derivation can either expand as new choices are introduced, or shrink as a result of resolved ambiguities.

Traditional Earley parser maintain set of states for each position in the input, describing all pending derivations. These state sets together form the Earley chart. A state is of the form $i:\;_kX \to \lambda \ast \mu$ where $$X$$ is a non-terminal, $$\lambda$$ and $$\mu$$ are either terminals or non-terminals, $$i$$ and $$k$$ are indexes of the input, and $$\ast$$ which called Earley dot refers to the current position $$i$$ corespondent to the rule $$X \to \lambda \mu$$.

Any state in the chart have following semantics,

• The current state position in the inputs is $$i$$, $$O_{1:i-1}$$ have been processed by the parser already. The states describing the parser state at position $$i$$ are collectively called state set $$i$$.

• Non-terminal $$X$$ was expanded starting at position $$k$$ in the input.

• The expansion of $$X$$ processed using the production $$X \to \lambda \mu$$, and has expanded the right-hand side (RHS) $$\lambda \mu$$ to the position indicated by the Earley dot.

Because Earley dot indicate the current processing index in the RHS. When Earley dot sits to the right of the entire RHS, it indicates the left-hand side (LHS) non-terminal has been fully expanded. We called it complete state. On the other hand, when Earley dot does not sit in the right side of the RHS, it indicate the LHS non-terminal haven't fully expanded. We called it incomplete state which shows it may or may not expanded in the future.

Earley parser as well as its probabilistic version iteratively follow three steps,

• Prediction. For each state $i:\;_kX \to \lambda \ast Y \mu,$ where $$Y$$ is a nonterminal anywhere in the RHS, and for all rules $$Y \to \nu$$expanding $$Y$$, add states $i:\;_iY \to \ast \nu$ A state produced by prediction is called a predicted state. Each prediction corresponds to a potential expansion of a nonterminal in a left-most derivation. Because the prediction follows the grammar rule, so it also called top-down prediction.

• Scanning. For each state $i:\;_kX \to \lambda \ast \alpha \mu,$ where $$\alpha$$ is a terminal symbol that matches the current input $$O_i$$, add the state $i+1:\;_kX \to \lambda \alpha \ast \mu,$ A state produced by scanning is called a scanned state. Scanning ensures that the terminals produced in a derivation match the input string.

• Completion. For each state $i:\;_jY \to \nu \ast$ and each state in set $$j$$, $$j \le i$$, that has $$Y$$ to the right of the dot, $j:\;_kX \to \lambda \ast Y \mu,$ add the state, $i:\;_kX \to \lambda Y \ast \mu,$ A state produced by completion is called a completed state. Each completion corresponds to the end of a nonterminal expansion started by a matching prediction step.

For each input symbol and corresponding state set, an Earley parser performs all three operations exhaustively.

### Recursive Earley estimator

Even we assume the state space system is context-free which leads to more generic inside outside generation process. Due to the physical restriction, we may still observer the system outcome followed by time indexed from left to right. For example, human language are structured based on grammar rule, it involve long term correlation even in a simple sentence. But when we hear somebody speaking, the words are still coming from left to right.

In such cases, in order to model the data through SCFG, it requires incremental estimation or parsing on time. The same scenario also happens on hidden Markov model called recursive Bayesian estimation which we mentioned before.

However in typical recursive time series analysis, the estimator do not require the entire observation available at beginning. Instead of using bottom-up parsing strategy, it require a top-down strategy to incrementally parse the incoming observation.

## Reference

1. Andreas Stolcke. 1995. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Comput. Linguist. 21, 2 (June 1995), 165-201.