Δευτέρα 20 Οκτωβρίου 2014

Model implementation technical details

Introduction


In this post I will present the details of the resampling filter that was presented in my previous post. More specifically, I will describe the Gibbs sampling algorithm #4 as it was described in Radford M. Neal's paper. This was the basic algorithm that was used in the previous post to augment the resampling process of the FastSLAM algorithm. In Neal's paper the information is presented in a nice and neat but also very abstract manner, so it's difficult for an uninitiated reader to follow all the details. Therefore, information on this post will also take elements from this paper from Herman Kamper as well as this from Xiaodong Yu. In the rest post I will first describe the algorithm as presented by Neal and then I will talk about it's details more explicitly. Finally I will present how these are applied to the aforementioned filter. Again some background on DPMM and MCMC methods is assumed.

Initial description of the collapsed CRP algorithm.

The description of the algorithm taken as is from Neal's paper is the following:
Finally, in a conjugate context, we can often intergrate analytically over the \(\begin{equation} \phi_{c} \end{equation}\), eliminating them from the algorithm. The state of the Markov chain then consists only of the \(\begin{equation} c_{i} \end{equation}\) which we update by Gibbs sampling using the following conditional probabilitites.

\begin{cases}if\  c=c_{j}\ for\ some\ j\neq i:\ P(c_i=c)= b\frac{n_{-i,c}}{n-1+a}\int_{}{}F(y_i,\phi)dH_{-i,c}(\phi) \\\ \ \ \ \ \ \ \ \ \ P(c_i\neq c_j\ for\ all\ i \neq\ j|c_{-i},y_i)=b\frac{a}{n-1+a}\int_{}{}F(y_i,\phi)dG_0(\phi)\end{cases}

Here, \(\begin{equation} H_{-i,c} \end{equation}\) is the posterior distribution of \(\begin{equation} \phi \end{equation}\) based on the prior \(\begin{equation} G_0 \end{equation}\) and all observations \(\begin{equation} y_j \end{equation}\) for which \(\begin{equation} j \neq i \end{equation}\) and \(\begin{equation} c_j =c \end{equation}\).
The third Gibbs sampling method can be summarized as follows:
Algorithm 3: Collapsed Gibbs sampling

Let the state of the Markov chain consist of \(\begin{equation} c_1,...,c_n \end{equation}\). Repeatedly sample as follows:
  • For \(\begin{equation} i=1,...n \end{equation}\) Draw a new value from \(\begin{equation} c_i|c_{-i},y_i \end{equation}\) as defined by the previous equation.


Yes, that's all, no joking here.

More context.

Let's take a step back and revisit our goal. The problem is: given a dataset which contains positional information of particles, cluster these the datapoints as a mixture of distributions. The number of distributions the data come from is not known beforehand. So the problem is to actually fit the data to an arbitrary, could be infinite, number of distributions. To do that, we will model the data set with a Dirichlet process mixture model and perform inference over it using Gibbs sampling.
Dirichlet process mixture model
In the figure the structure of the model is shown, but additional information like distribution hyperparameters is abstracted. The main idea is that the Dirichlet process is given a base distribution  \(\begin{equation} G_0 \end{equation}\)  is used to pick a cluster for the data on every iteration. In each cluster the data are further distributed according to a Gaussian distribution. In many cases the base distribution \(\begin{equation} G_0 \end{equation}\) is chosen to be a conjugate prior to the data distribution as the probabilities can the be calculated in closed form. A more explicit description of the algorithm is the following

Algorithm 3: Collapsed Gibbs sampling extended

  • If we are on the first iteration of the algorithm randomly initialize data to clusters and update their respective parameters
  • Given  \(\begin{equation} a_0^{t-1}, \{\theta_k^{(t-1)}\} and \{z_i^{t-1} \} \end{equation}\) from the previous iteration, sample a new set of  \(\begin{equation} \{ \theta_k^{(t-1)}\}_{k=1}^K,\ z_i^{(t)}\}_{i=1}^n \end{equation}\) as follows:
    1. Set \(\begin{equation} z=z^{t-1}, a_0=a_0^{t-1} \end{equation}\)  
    2. For i=1,...,n
      1. Remove data item \(\begin{equation} x_i \end{equation}\)  from cluster \(\begin{equation} z_i \end{equation}\) , since we are going to sample a new \(\begin{equation} z_i \end{equation}\) for \(\begin{equation} x_i \end{equation}\) .
      2. If \(\begin{equation} x_i \end{equation}\)  is the only data in its current cluster, this cluster becomes empty after step 2.1. This cluster is then remove, together with its parameter and K is decreased by 1.
      3. Re-arragne cluster indices so that 1,....,K are active(non-empty)
      4. Draw a new sample for \(\begin{equation} z_i \end{equation}\)  from the following probabilities: \begin{cases}\ P(z_i=k, k \leq K ) \propto \frac{n_{-i,k}}{n-1+a_0}\int_{}{}F(x_i,\theta_k^{(t-1)})\ with\  n_{k,-i}= \sum_{j \neq i}\delta(z_j-k) \\\ P(z_i = K+1 ) \ \ \ \ \propto \frac{a_0}{n-1+a_0}\int_{}{}F(x_i,\theta)dG_0(\theta)\end{cases}
      5. If  \(\begin{equation} z_i= K+1 \end{equation}\) we get a new cluster. Index this cluster K+1, sample a new cluster parameter  \(\begin{equation} \phi_i \end{equation}\) from  \(\begin{equation} H(\phi_i|x_i) \end{equation}\)  assign it to  \(\begin{equation} \theta_{k+1} \end{equation}\) and increase K by 1.
    3. For k=1...k
      1. Sample cluster parameter of each cluster,  \(\begin{equation} \theta_k \end{equation}\) from the following distribution:  \(\begin{equation} \theta_k^{(t)}\propto G_0(\theta_k|\lambda)L(x_k^{(t)}|\theta_k^{(t01)} \end{equation}\)
    4. Set z^{(t)} = Z
    5. If a_0 ~ Gamma(a,b), sample  \(\begin{equation} a_0^{(t)}\~p(a_0|K,n,a,b)  \end{equation}\) via auxiliary variable method.

Which is a more explicit description of the algorithm that is easier to follow through. Still there is the need for a bit more context.   \(\begin{equation} H(\phi_i|x_i) \end{equation}\) is defined as
\(\begin{equation} H(\phi_i|x_i) = \frac{F(x_i,\theta)dG_0(\theta)}{\int_{\theta}{}F(x_i,\theta)dG_0(\theta)} \end{equation}\) Where \(\begin{equation} G_0 \end{equation}\)  is the base distribution and F is the distribution with which the data are further distributed within their cluster. When the prior distribution \(\begin{equation} G_0 \end{equation}\) is conjugate to F then the integral can be calculated in closed form. \(\begin{equation} \theta\end{equation}\) represents the sufficient statistics of the cluster that are intergrated out.

An example in a nutshell

So as the title states, I will try to glue theory with application using a pseudocode based example. The base distribution is chosen to be a normal inverse Wishart, and the data will be Gaussian distributed. Normal inverse Wishart was chosen because it is a conjugate prior to the multivariate Gaussian distribution. A good place to start regarding distribution conjugacy is,of course, wikipedia

Algorithm 3: Collapsed Gibbs sampling an example using pseudocode

Gibbs(points, G0, alpha)
#Gibbs will run collapsed Gibbs sampler to the data. 
#The function will stop when an arbitrary convergence criterion is reached
    sufficientStatisticsArray  = [μ,ν,κ,λ][sizeofPoints] 
#We can have maximally sizeOfPoints clusters where sizeOfPoints is the size of the dataset
    NumberOfItemsInCluster=zeros(1,n);
    ClusterUsed = zeros(n, 1);
    for all points: p
        clusterUsed(p) = randomDataInitialization(size(points))  
        increaseNumberOfItemsInCluster(p);
        sufficientStatisticsArray (clusterUsed(p)) = update(p, sufficientStatistics, '+');
#If the cluster was empty before, then the statistics of G0 will be used. else the statistics of the #distribution will be updated with respect to the corresponding value
    endfor

    while(convergenceCriterion==false)
        for all points:p
            reduceNumberOfItemsInCluster(p);
            sufficientStatisticsArray (clusterUsed(p)) = update(p, sufficientStatistics , '-');
            newClusterIndex = sampleClusterWithProb(NumberOfItemsInCluster,
                                                                                   alpha,
                                                                                   datapoint,
                                                                                   G0,
                                                                                   sufficientStatisticsArray)
            reduceNumberOfItemsInCluster(newClusterIndex );
            sufficientStatisticsArray (clusterUsed(newClusterIndex )) = update(p, sufficientStatistics , '+');
            convergenceCriterion = checkConvergence()
    endwhile
    return c
endGibbs

update(p,κ,λ,μ,ν , action)
#action is either plus or minus
    κ1 = κ action Ν
    ν1 = ν action Ν
    μ1 = κ/(κ action Ν)*μ action Ν/(κ action Ν)*z
    λ1 = λ action κ/(κ action Ν)*(p-μ)*(p-μ)'
    return κ1,ν1,μ1,λ1,
endUpdate

sampleClusterWithProb(itemsInClusers, alpha, datapoint, p, G0, sufficientStatistics)
    for all active clusters
        n(cluster) = getMarginalProbForCluster(datapoint, sufficientStatistics(cluster))
    end for
    n_new = getMarginalProbForCluster(datapoint, G0)
    const   = calculateNormalizingConstant
    randVar = random(1)
    If (randVar< n_new/const)    k = newClusterIndex
    else    k = existingClusterIndex
    return k
endsampleClusterWithProb

getMarginalProbForCluster(datapoint, κ,λ,μ,ν)
    marginalForDataGivenSufficientStatistics(data|κ,λ,μ,ν) =  \(\begin{equation} \pi^{-ND/2}\frac{κ_0^{D/2}|S_0|^{ν_0/2}}{κ_n^{D/2}|S_ν|^{ν_n/2}}\prod_{i=1}^D\frac{\Gamma(\frac{ν_n+1-i}{2})}{\Gamma(\frac{ν_0+1-i}{2})}  \end{equation}\)
    return marginalForDataGivenSufficientStatistics
endgetMarginalProbForCluster


The pseudocode in the main body of the code is easy to follow as it is very descriptive but helper functions need more explanations. The update formulas can be found in Kamper's paper on page 4. The formula for the marginal is given on Kamper's paper for a normal inverse prior and Gaussian distributions within the data.
Since I found the marginal distribution to be the most confusing to follow, I will add some remarks on it:
  • The normal-inverse-Wishart distribution is a four parameter distribution that serves as a conjugate prior to the Multivariate Gaussian distribution. The parameters can be though of as follows: ΝΙW(κ,λ,μ.ν)
    • μ: Initial mean estimate, our prior on the mean
    • κ: A smoothing mean parameter, or to put it differently, how strong our belief on the mean is
    • λ: Initial Σ estimate, our prior on variance
    • ν: The degrees of freedom, How strongly we believe on the prior.
  • When setting \(\begin{equation} \kappa_0=0, \nu_0=d, ||\Psi||=0 \end{equation}\) then the posterior predictive estimate for the cluster is reduced to \(\begin{equation} P(x_i|x_{-i},z_i=k,z_{-i},\lambda) \propto N(\mu_{k,-i},\Sigma_{k,-i}) \end{equation}\)  as it is shown in this excellent blog post. Another implementation of this technique in Java can be found here by Vasilis Vryniotis.
  • Finally, if we have initialized the hyperparameters, this formula can be reduced more, so for μ0=[0,0] ,κ0 = 1 ,ν0 = 4, λ0 = Ι(2,2) the marginal likelihood formula is reduced to: \(\begin{equation} (1+\frac{(z-\mu_0)^T*S^{-1}*(z-\mu_0)}{nu})^{-\frac{\nu+D}{2})} * \frac{\Gamma(\frac{\nu+p}{2})}{\Gamma(\frac{\nu}{2})}*Det(\nu*D*S)^{-\frac{1}{2}}\end{equation}\).

Appendix. Derivation of the final formula

We can get the final formula by substituting the data to the general formula for calculating the marginal of Gaussian-NIW distributions that was shown above. The formula then becomes:
\begin{equation} \pi^{-1*1/2}\frac{κ_0^{1/2}|I|^{ν_0/2}}{κ_n^{1/2}|S_ν|^{ν_n/2}}\prod_{i=1}^1\frac{\Gamma(\frac{ν_n+1-i}{2})}{\Gamma(\frac{ν_0+1-i}{2})}  \end{equation}
Since \(\begin{equation} k_0=1 \end{equation}\) and \(\begin{equation} |I|=1 \end{equation}\) the top term is reduced to unit value. Furthermore, \(\begin{equation} pi^{-1/2} \end{equation}\) is a constant for all the values so it can be disregarded. 
The formula is now reduced to:
\begin{equation} \frac{1}{κ_n^{1/2}|S_ν|^{ν_n/2}}\prod_{i=1}^1\frac{\Gamma(\frac{ν_n+1-i}{2})}{\Gamma(\frac{ν_0+1-i}{2})}  \end{equation}

The next step is to calculate the denominator \(\begin{equation} k_n^{.5}*|S_n|^{\nu/2} \end{equation}\).
If we now take \(\begin{equation} S_n = S_0*(\frac{k_0+1}{k_0(\nu-p+1)}) \end{equation}\) then the expansion of the determinant of the denominator \(\begin{equation} S_n \end{equation}\).is the following:
\(\begin{equation} | S_n + S_{\overline{x}}+ \frac{k_0}{k_0+1}(x-\overline{x})(x-\overline{x})^T | \end{equation}\).
The \(\begin{equation} S_{\overline{x}} \end{equation}\) term represents the empirical average, and is zero since we have only one value.
Using the property of the determinant \(\begin{equation} Det(X+cr) = Det(X)(1 + r*X^{-1}*c) \end{equation}\) we expand the determinant to:

\(\begin{equation} | S_n + S_{\overline{x}}+ \frac{k_0}{k_0+1}(x-\overline{x})(x-\overline{x})^T | = Det(S_n)(1+ \frac{(x-\overline{x})*S^{-1}*(x-\overline{x})^T}{\nu}) \end{equation}\).
If we extend the determinant term we get:
\(\begin{equation} Det(S_n) = Det(S_0*(\frac{k_0+1}{k_0(\nu-p+1)})) \end{equation}\).
\(\begin{equation}  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \  = Det(S_0*(\frac{k_0}{k_0(\nu-p+1)})+S_0*(\frac{1}{k_0(\nu-p+1)})) \end{equation}\).
\(\begin{equation}  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \  = Det(S_0*(\frac{1}{\nu_n})+S_0*(\frac{1}{k_0(\nu_n)})) \end{equation}\).
\(\begin{equation}  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \  = Det(S_0*(\frac{1}{\nu_n}(1+\frac{1}{k_0}))),  \ \ \ k_0=1 \end{equation}\).
\(\begin{equation}  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \  = Det(S_0*\nu^{-1}*D)  \end{equation}\).
So the reduced formula is the following:


\(\begin{equation} (1+\frac{(z-\mu_0)^T*S^{-1}*(z-\mu_0)}{nu})^{(-\frac{\nu+1}{2})} * \frac{\Gamma(\frac{\nu+1}{2})}{\Gamma(\frac{\nu}{2})}*Det(\nu*D*S)^{-\frac{\nu_ν}{2}} \end{equation}\)



Δεν υπάρχουν σχόλια:

Δημοσίευση σχολίου