To be able to make the most of the wealth of cosmological information available via observations of the large scale structure of the universe it is vital to have a strong model of how observable objects such as galaxies trace the underlying dark matter. In this work we used a neural bias model: a physically motivated neural network from which we can infer the halo mass distribution function. This function describes the abundance of halos with a certain mass given a dark matter density environment, where the halos are compact dark matter objects in which galaxies are hosted. As such, the neural bias model gives us a strong, but agnostic, bias model mapping the dark matter density field to (tracers of) the observable universe. Such a neural bias model can be included in the BORG inference scheme such that the initial conditions of the dark matter density and the parameters of the neural bias model are sampled using Hamiltonian Monte Carlo.

# Halo mass distribution function

The halo mass distribution function describes the number of dark matter halos at a certain mass given a dark matter density environment. It has been well studied in the past, and as such we know the approximate form of the function, which is described by the Press Schechter formalism which is a power law at small masses with an exponential cut off at high masses. There are less well understood elements also, including how the non-local density environment affects the abundance of halos and the form of the stochasticity from which halos are drawn from the halo mass distribution function. This stochasticity describes how one obtains the actual number of observed halos of a certain mass given that the halo mass distribution function only describes the probability of observing such a halo. The sampling of halos from the halo mass distribution function is normally assumed to be Poissonian, but this is known to be insufficient. Whilst we consider a Poissonian likelihood in this work, it should be noted that it is Poisson for a field of summaries provided by a neural physical engine and so includes information from the local surrounding region.

# Zero-shot training, Bayesian neural networks

The neural network used in this work is not pre-trained and is conditioned on the observed data only, in this case a halo catalogue obtained from a high resolution dark matter simulation. Zero-shot training describes a method of fitting a function without any training data. Several components are necessary to be able to achieve such a fitting of the neural bias model introduced here. These are: basing the design of the architecture of the network on physical principles; using appropriate functions to model the form of the halo mass distribution function; and finding a stable sampling procedure to obtain parameter samples from the posterior.

## Neural physical engines

Neural physical engines are simply neural networks that are built using physical principles. For example, with a physical model of how some data is distributed according to the parameters of a model, one builds a neural network with the symmetries of such a model built into its architecture. This is particularly useful for several reasons. Primarily, such a neural physical engine is massively protected from overfitting. Overfitting is prevented because only relevant information for the problem in hand is allowed to be fitted, and the network is insensative to spurious features of the data, such as noise. An added benefit to these networks is the massive reduction in the number of parameters necessary to fit the required function. This improves the computational efficiency of the algorithm, decreases training times and increases the interpretability of the network.

*The neural physical engine is a physically motivated neural network which maps a dark matter density distribution, evolved by Lagrangian perturbation theory, to a set of summaries which are informative about the abundance of halos of a certain mass on the grid.*

When building the neural bias model we construct a neural physical engine which takes a small patch of the gridded dark matter density field evolved from the initial conditions to today using Lagrangian perturbation theory as an input and outputs a single informative summary per voxel about the abundance of halos with a certain mass at that patch of the dark matter density field. We know that the halo mass distribution function is only sensitive to local information, and at the resolution we are working at, mostly due to the amplitude of the dark matter density field rather than the exact position of structures such as filaments or nodes in the dark matter field. We also know that the data is distributed evenly across the volume, i.e. there is translational and rotational invariance in the dark matter density field. This encourages us to use parameterised three-dimensional convolutional kernels with an extent which is only as large as the relevant scales and where the parameters are shared within the kernels according to a radial symmetry.

*The convolutional kernels used in neural networks are discrete and gridded, with each element of the array being an independent trainable parameter.
We introduce a method by which we can expand the kernels in terms of multipoles by associating weights at equal distances (and at given rotational angles) from the centre of the kernel.
Take for example a 3x3x3 convolutional kernel.
Normally this would have 27 free parameters.
By looking at the radially symmetric kernel, i.e. ℓ=0, each corner has an associated weight, as does each edge and each face and there is a single weight for the central element, equating to a total of 4 free parameters.
Then in the case of the dipolar kernel, i.e. ℓ=1, there are three independent kernels each with 3 parameters, making a total of 9.
For ℓ=2, there are now 5 independent kernels with 2 parameters each and including ℓ=3 saturates the freedom of the convolutional kernel and so no further multipoles are needed to fully parameterise the general kernel.
We can use this expansion to either reduce the number of parameters necessary by truncating in multipoles, or we can learn more about the informational content of the data in terms of expansion in multipoles.
In the second case, once trained, one can look at the response of the data in independent multipole paths, the larger the response the more informative that multipole is about the roll of the data in the neural network.
The code for producing the multipole kernels can be found at github:multipole_kernels.*

*The size of the convolutional kernel used is extremely important for a neural physical engine.
The size of the kernel is known as the receptive field, and dictates the size of the correlations which can be learned by the neural network. The receptive field should be chosen based on the data. If it is too small then it is impossible to learn about relevant features in the data and will tend to average out even the small scale features since it cannot distiguish the large scale modes. Likewise, if the receptive field is too large then the kernel will be massively overparameterised which can lead to overfitting and the fitting of spurious large scale features of the data. Since these large scale features are less common they are therefore less likely to be averaged out during training.This leads to a network which is difficult to train and has a much larger computational cost.
It should be noted that stacking convolutions leads to a larger receptive field throughout the network, but does not protect one from the above problems. The kernel size should be chosen carefully at each layer make the most of the distribution of information at each layer independently (this can be very tricky to do).*

## Neural density estimators

Since we wish to model the halo mass distribution function we need to consider an architecture whose output is a function (or at least an evaluation of the function). To do so we use a modified mixture density network which is a type of neural density estimator. Neural density estimators are neural networks whose outputs are samples from a fitted probabililty distribution function. For the halo mass distribution function we use a mixture of two Gaussian distributions where we allow the predicted amplitudes to be free positive parameters but organise the predicted mean parameters in order of magnitude. This breaks the degeneracy between the two Gaussians and allows us to have a smooth function whose amplitude can accurately approximate the abundance of halos.

*A mixture density network is a neural network which maps an input to a set of parameters for a collection of probability distributions. For example, one can predict the means, μ, standard deviations, σ, and amplitudes, α, of several Gaussian distributions and sum these Gaussians together. Provided that the amplitudes sum to 1, the mixture density will remain correctly normalised to be interpreted as a probability distribution. The mixture density network can then be trained by evalutating the value of the distribution at the labels for the input data and minimising the negative logarithm of the distribution.*

## Likelihood, \(\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{\delta}_\textsf{LPT})\)

To fit the halo mass distribution to the halo catalogue used in this work we consider a Poisson likelihood. If our evolved dark matter density field, \(\boldsymbol{\delta}_\textsf{LPT}\), is passed through the neural physical engine, with parameters \(\boldsymbol{\theta}_\textsf{NPE}\), to get a field of summaries, \(\boldsymbol{\psi}_\textsf{NPE} = \boldsymbol{\psi}_\textsf{NPE}(\boldsymbol{\delta}_\textsf{LPT}, \boldsymbol{\theta}_\textsf{NPE})\), our halo mass distribution function is given by

\[n(M|\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_\textsf{MDN})= \sum_{i=1,2} \alpha(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)\mathcal{N}(M| μ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i),σ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)),\]where \(\mathcal{N}(M|μ, σ)\) is the value of a Gaussian with mean \(\mu\) and standard deviation \(\sigma\) evaluated at halo mass \(\textsf{log}(M)\). The Poisson likelihood can be written as two terms. The first term evaluates the neural halo mass distribution function for every halo in the catalogue, where the density environment is obtained from the patch of \(δ_\textsf{LPT}\) around each voxel index corresponding to each halo. This term therefore fits the abundance scale due to the catalogue. The second term is the integral over halo mass of the whole function for the entire evolved density field and therefore fits the shape of the function.

Note that by using this likelihood we never have to explicitly make a stochastic sampling of the halos to compare to the catalogue, although we could use the fitted halo mass distribution function to generate halo catalogues by using the value of the evaluated neural bias model as the rate parameter for Poisson sampling.

We will also include a Gaussian prior, \(\pi(\boldsymbol{\theta})\), on all the parameters of the neural bias model. We ensure that these weights and biases are centred on zero by rescaling them using prior knowledge of the amplitude of the abundance measured from the halo catalogue and the halo mass threshhold. Since the parameters of the neural bias model are centred on zero, we just need to a width to the Gaussian prior which is large enough to allow for parameter exploration, but tight enough to make sampling the parameters feasible.

## HMCLET

To be able to sample the weights of the neural bias model we use a modified Hamiltonian Monte Carlo. Hamiltonian Monte Carlo is a way of efficiently drawing samples from extremely large dimensional likelihood distributions. One starts with an initial set of neural bias model parameters, \(\boldsymbol{\theta}_0\), and proposes a new set, \(\boldsymbol{\theta}^*\), given a momentum, \({\bf p}\), drawn from a proposal distribution, \({\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})\). M is a mass matrix which describes the time scale along the parameter direction and correlation between the parameters. One then solves Hamilton’s equations, \(d\boldsymbol{\theta}/dt = {\bf M}^{-1}{\bf p}\) and \(d{\bf p}/dt = -\nabla \mathcal{V}(\boldsymbol{\theta})\) where the Hamiltonian is described by \(\mathcal{H}(\boldsymbol{\theta}, {\bf p}) = \mathcal{V}(\boldsymbol{\theta}) + \mathcal{K}(\boldsymbol{p})\), with \(\mathcal{V}(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta}|\delta_\textsf{LPT}) + \pi(\boldsymbol{\theta})\) as the potential energy formed from the likelihood and the prior and \(\mathcal{K}(\boldsymbol{p}) = -{\bf p}^\textsf{T}{\bf M}^{-1}{\bf p}\) as a kinetic energy. Proposed parameters are then excepted according to a probablity given by \(\alpha = \textsf{Min}[\textsf{exp}(\Delta\mathcal{H}), 1]\), where \(\Delta\mathcal{H}\) is the difference between the energy at the proposed parameter values and the current parameter values. By conserving energy, one ensures that all proposals are accepted. It is ususal to use a symplectic integration scheme, such as the leapfrog algorithm (ϵ-discretisation) to solve these ODEs.

*The leapfrog algorithm involves drawing a momentum from a proposal distribution, \({\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})\), and taking a step of size \(\epsilon\) from the initial parameter positions \(\boldsymbol{\theta}_0\) according to \({\bf p} = {\bf p} - \epsilon\nabla \mathcal{V}(\boldsymbol{\theta}_0)/2\) giving \(\boldsymbol{\theta}_\textsf{next} = \boldsymbol{\theta}_0+\epsilon{\bf M}^{-1}{\bf p}\). This makes up the first half step in the leapfrog. The same procedure of updating \({\bf p}\) and \(\boldsymbol{\theta}\) occurs N number of steps, where the rest of the steps are full (\({\bf p} = {\bf p}-\epsilon\nabla \mathcal{V}(\boldsymbol{\theta})\)). The last half step is then taken. The choice of ϵ dictates the accuracy of the integration. If \(\epsilon\) is large then Hamilton’s equations are solved more inaccurately which can lead to energy loss between the initial and proposed parameters, which increases the rejection. On the other hand, if \(\epsilon\) is small then more samples are accepted since there is less (or less likely to be) energy loss, but this comes at a higher computational cost.*

Since neural networks are complex and in general have a large number of highly somewhat-degenerate parameters, it is very difficult to know the mass matrix *a priori*.
This means that extremely large steps can be made along the likelihood surface leading to numerical stability issues and improper sampling.
To overcome this, we can consider using the second order geometric information of the likelihood surface by calculating its Hessian using quasi-Newtonian methods.

*The Hessian (\({\bf B}\)), i.e. the second order gradient, of the likelihood surface can be calculated using quasi-Newtonian methods. Quasi-Newtonian methods are root-finding algorithms where the Hessian (or Jacobian) are approximated. There are many ways to calculate the approximate Hessian, we use the BFGS method in this work. This method is convenient since it can be calculated for free as part of the leapfrog algorithm. When using the second order geometric information the ODEs become \(d\boldsymbol{\theta}/dt = {\bf B}{\bf M}^{-1}{\bf p}\) and \(d{\bf p}/dt = -{\bf B}\nabla \mathcal{V}(\boldsymbol{\theta})\). This means that, although the mass matrix is still needed to set the time scales along the parameter directions, the momenta get effectively rescaled by the Hessian, breaking parameter degeneracies and allowing for an efficient acceptance ratio.*

# Results

With a neural bias model formed of a neural physical engine which is sensitive to non-local radial information, a neural density estimator to give us evaluations of suffciently arbitrary functions and a sampling scheme which can effectively explore the complex likelihood landscape we can now infer the halo mass distribution function.

*The BORG algorithm infers the initial conditions of the dark matter distribution. First the initial conditions are drawn from a prior given a cosmology to generate an initial dark matter density field. In this work, this dark matter density field is then evolved forward using Lagrangian perturbation theory to obtain the dark matter density field today. This is then passed through the neural physical engine to obtain an informative field of summaries about the abundance of dark matter halos on a grid. This can then be compared to the observed halo catalogue via the Poissonian likelihood between the halo mass distribution function provided by the neural density estimator of the neural bias model. Evaluating this likelihood allows us to obtain posterior samples of all of the initial phases of the dark matter density distribution and all of the parameters of the neural bias model.*

We use a halo catalogue constructed using Rockstar from a chunk of the VELMASS Ω dark matter simulation, which has a Planck-like cosmology. This catalogue has about 10,000 halos with a mass threshhold of 2x10^{12} solar masses.

As shown in the figures below, we are able to fit the halo mass distribution function extremely well, with sampling around the observed catalogue. Furthermore, the information used comes from the non-local region around the each voxel in the gridded density field, showing that the surrounding area holds information about the abundance of halos.

*The abundance of halos at a certain mass given a density environment from the VELMASS halo catalogue is plotted using the diamonds with dashed lines.
The more dense the environment, the more halos are expected at all masses.
The solid lines are the mean halo mass distribution function values from the neural bias model.
The filled areas are the 1σ intervals either side of the mean obtained by the samples from the Markov chain.
We can see that the fit is very good (even with the very simple model considered here), and that the shape of the function changes with density environment.
This shows that the neural bias model is able to account for the response of the density field.*

*Here we see an example of an initial density field and the same field evolved using Lagrangian perturbation theory on the top row.
The bottom row shows the effect of the neural physical engine which provides an enhancement in constrast, which is a more informative summary of the abundance of halos than the LPT field.
This is because non-local information is gathered from the surrounding voxels by the neural physical engine.
The last box (bottom right) is the true halos from the VELMASS halo catalogue placed onto the same grid.
Note that the NPE field does not look like the halo distribution since a Poisson sampling of the halo mass distribution function is needed to get a stochastic realisation of the halo distribution.*

# Future work

The methods presented in this paper show a state of the art in terms of machine learning as well as new methods for dealing with the bias model in BORG and for generating halo catalogues from the neural bias model. We will continue our work in two main directions. The first is to look at bypassing the halos completely by learning the form of the likelihood using some form of neural density estimation (or neural flow) which would allow us to be more agnostic about the form of the likelihood. This would mean that we could, in principle, marginalise out the effect of the ambiguity in the likelihood to provide robust constraints on the initial density phases and cosmology. The second is to use architecture optimisation schemes to find a better fit to the halo mass distribution function for use in halo catalogue generation.

# References

- Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, Supranta Sarma Boruah, Jens Jasche, Michael J. Hudson, 2019, submitted to MNRAS, arXiv:1909.06379

Authored by T. Charnock

Post identifier: /method/machine%20learning/npe