The Aquila consortiumThe Aquila consortium aims at understanding the Universe.
https://www.aquila-consortium.org/
Mon, 25 May 2020 21:06:57 +0200Mon, 25 May 2020 21:06:57 +0200Jekyll v3.8.3Simulating the Universe on a mobile phone<h1 id="overview">Overview</h1>
<p>There are about two trillion galaxies in the observable Universe, and the evolution of each of them is sensitive to the presence of all the others. Can we put this all into a computer, or even a mobile phone, to simulate the evolution of the Universe? In a recent paper, we introduced a perfectly parallel algorithm for cosmological simulations which addresses this question.</p>
<p>Modern cosmology relies on very large data sets to determine the content of our Universe, in particular the amounts of dark matter and dark energy. These large datasets include the positions and electromagnetic spectra of very distant galaxies, up to 20 billion light-years away. In the next decade, the Euclid mission<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> and the Vera Rubin observatory,<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> in particular, will obtain information on several billion galaxies.</p>
<h1 id="physical-challenges">Physical challenges</h1>
<p>Making the link between our knowledge of physics, for example the equations that govern the evolution of dark matter and dark energy, and astronomical observations requires considerable computational resources. Indeed, the most recent observations cover huge volumes: of the order of that of a cube of 12 billion light-years side length. As the typical distance between two galaxies is only a few million light-years, we have to simulate around one trillion galaxies to reproduce the observations.</p>
<p>In addition, in order to follow the physics of the formation of these galaxies, the spatial resolution should be of the order of ten light-years. Ideally, simulations should therefore have a scale ratio (that is, the ratio between the largest and smallest physical lengths of the problem) close to a billion. No computer, existing or even under construction, can achieve such a goal.</p>
<p>In practice, it is therefore necessary to use approximate techniques, consisting in “populating” the large-scale structures of the Universe with fictitious (but realistic) galaxies. This approximation is further justified by the fact that the evolution of galaxies’ components, for example stars and interstellar gas, involves very fast phenomena in comparison to the global evolution of the cosmos. The use of fictitious galaxies still requires simulating the dynamics of the Universe with a scale ratio of around 4,000, which is just possible with today’s supercomputers.</p>
<h1 id="the-problem-of-computational-limits">The problem of computational limits</h1>
<p>Simulating the gravitational dynamics of the Universe is what physicists call a <script type="math/tex">N</script>-body problem. Although the equations to be solved are analytical, as in most cases in physics, solutions have no simple expressions and require numerical techniques as soon as <script type="math/tex">N</script> is larger than four. The direct numerical solution consists in explicitly calculating the interactions between all the pairs of bodies, also called “particles”. The computation of forces by direct summation was the favoured technique in cosmology at the beginning of the development of numerical simulations, in the 1970s. At present, it is mainly used for simulations of star clusters and galactic centres. The number of particles used in “direct summation” simulations is represented by green dots in figure 1, where the <script type="math/tex">y</script>-axis has a logarithmic scale.</p>
<p class="figure wide"><img src="/assets/posts/scola/Moore_law_cosmosims.png" alt="Number of particles in cosmological simulations as a function of time" />
<em>Evolution of the number of particles used in <script type="math/tex">N</script>-body simulations as a function of year of publication. Different symbols and colours correspond to different methods used to compute gravitational dynamics (direct summation in green, advanced algorithms in orange). For comparison, Moore’s law concerning computer performance is represented by the black dotted line.</em></p>
<p>The direct summation method has a numerical cost which increases like <script type="math/tex">N^2</script>, the number of pairs of particles considered. For this reason, in spite of improvements provided by hardware accelerators such as graphics processing unit (GPUs), the number of particles used with this method cannot grow as quickly as in the famous “Moore’s Law”, which predicts a doubling of computer hardware performance every 18 months. Moore’s law was verified for about four decades (1965-2005), but as traditional hardware architectures are reaching their physical limit, the performance of individual compute cores attained a plateau around 2015 (see figure 2). Therefore, cosmological simulations cannot merely rely on processors becoming faster to reduce the computational time.</p>
<p class="figure wide"><img src="/assets/posts/scola/Moore_law_processors.png" alt="Single-threaded floating point performance as a function of time" />
<em>Single-threaded floating point performance of CPUs as a function of time. Different trademarks and models are represented by different colours and symbols as indicated in the caption. This plot is based on adjusted SPECfp® results.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></em></p>
<p>In order to reduce the cost of simulations, most of the work in numerical cosmology since 1980 has consisted in improving algorithms. The aim was to circumvent the explicit calculation of all gravitational interactions between particles, especially for pairs which are the most distant in the volume to be simulated. These algorithmic developments have enabled a huge increase in the number of particles used in cosmological simulations (see the orange triangles in figure 1). In fact, since 1990, the increase in computational capacity in cosmology has been faster than Moore’s Law, with software improvements adding to the increase in computer performance (more details in <a href="http://florent-leclercq.eu/blog.php?page=2">this blog post</a>).</p>
<p>In 2020, with the architectures of modern supercomputers, calculations are no longer limited by the number of operations that processors can perform in a given time, but by the inherent latencies in communications among the different processors involved in so-called “parallel” calculations. In these computational techniques, a large number of processors work together synchronously to perform calculations far too complex to be carried out on a conventional computer. The stagnation of performances due communication latencies has been theorised in “Amdahl’s law” (see figure 3), named after the computer scientist who formulated it in 1967. It is now the main challenge for cosmological simulations: without improving the “degree of parallelism” of our algorithms, we will soon reach a technological plateau.</p>
<p class="figure wide"><img src="/assets/posts/scola/Amdahl_law.png" alt="Amdahl’s law" />
<em>Amdahl’s law: theoretical speed-up in the execution of a program as a function of the number of processors executing it, for different values of the parallel fraction of the program (different lines). The speed-up is limited by the serial part of the program. For example, if 90% of the program can be parallelised, the theoretical maximum speed-up factor using a large number of processors would be 10.</em></p>
<h1 id="the-scola-approach-divide-and-conquer">The sCOLA approach: divide and conquer</h1>
<p>Let us go back to the physical problem to be solved: it is about simulating the gravitational dynamics of the Universe at different scales. At “small” scales, there are many objects that interact with each other: numerical simulations are required. But at “large” spatial scales, that is to say if we look at figure 4 from very far, not much happens during evolution (except for a linear increase of the amplitude of inhomogeneities). Despite this, with traditional simulation algorithms, the gravitational effect of all the particles on each other must be calculated, even if they are very far apart. It is expensive and almost useless, since most of gravitational evolution is correctly described by simple equations, which can be solved analytically without a computer.</p>
<p class="figure wide"><img src="/assets/posts/scola/scola_comparison.png" alt="Comparison between traditional and sCOLA simulations" />
<em>Comparison between a traditional simulation (left panel) and a simulation using our new algorithm (right panel). In our approach, the volume of the simulation is a mosaic made of “tiles” calculated independently and whose edges are represented by dotted lines.</em></p>
<p>In order to minimise unnecessary numerical calculations, it is possible to use a hybrid simulation algorithm: analytical at large scales and numerical at small scales. The underlying idea, called spatial comoving Lagrangian acceleration (sCOLA<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>), is common in physics: it is a “change of frame of reference”. In this framework, large-scale dynamics is taken into account by the new frame of reference, while small-scale dynamics is solved numerically by the computer, using conventional calculations of the gravity field. Unfortunately, the most naive version of the sCOLA algorithm gives results that are too approximate to be usable. In our last publication,<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup> we modified sCOLA in order to improve its accuracy.</p>
<p>Furthermore, we have realised that this concept makes it possible to “divide and conquer”. Indeed, given a large volume to be simulated, sCOLA allows sub-volumes of smaller size to be simulated independently, without communication with neighbouring sub-volumes. Our approach therefore makes it possible to represent the Universe as a large mosaic: each of the “tiles” in figure 4 is a small simulation that a modest computer can solve, and the assembly of all the tiles gives the overall picture. This is what is called in computer science a “perfectly parallel” algorithm, unlike all cosmological simulation algorithms so far. Thanks to it, we have been able to obtain cosmological simulations at a satisfactory resolution, while remaining on a relatively modest computing facility (figure 5).</p>
<p>Our perfectly parallel sCOLA algorithm has been implemented in the publicly available <strong>Simbelmynë</strong> code,<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> where it is included in version 0.4.0 and later.</p>
<p class="figure"><img src="/assets/posts/scola/horizon_cluster.png" alt="A GPU-based computer" />
<em>A GPU-based computer at the Institut d’Astrophysique de Paris. Its costs represents only a hundredth of that of a supercomputer at national computing facilities.</em></p>
<h1 id="new-hardware-to-simulate-the-universe">New hardware to simulate the Universe</h1>
<p>This new algorithm is not limited to being used in small computing facilities, but allows to envisage new ways of exploiting computing hardware. Ideally, each of the “tiles” could be small enough to fit in the “cache memory” of our computers, that is, the part of the memory that processors can access in the smallest amount of time. The resultant communication speed up would allow us to simulate the entire volume of the Universe extremely quickly, or even at a resolution never achieved so far.</p>
<p>Going further, we can even imagine that each of the simulations corresponding to a “tile” would be small enough that it can be run on a modern mobile phone! This parallelisation technique would be based on a platform such as Cosmology@Home<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>, which is dedicated to distributed collaborative computing. This platform is derived from the efforts initiated by SETI@Home<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup> for the search for extraterrestrial intelligence.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://www.euclid-ec.org/">https://www.euclid-ec.org/</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://www.lsst.org/">https://www.lsst.org/</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="http://spec.org/">http://spec.org/</a> <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>S. Tassev, D. J. Eisenstein, B. D. Wandelt, M. Zaldarriaga, <em>sCOLA: The N-body COLA Method Extended to the Spatial Domain</em> (2015), <a href="https://arxiv.org/abs/1502.07751">arXiv:1502.07751</a> <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>F. Leclercq, B. Faure, G. Lavaux, B. D. Wandelt, A. H. Jaffe, A. F. Heavens, W. J. Percival, C. Noûs, <em>Perfectly parallel cosmological simulations using spatial comoving Lagrangian acceleration</em>, A&A, in press (2020), <a href="https://arxiv.org/abs/2003.04925">arXiv:2003.04925</a> <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>The Simbelmynë code: <a href="http://simbelmyne.florent-leclercq.eu">homepage</a> <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p><a href="https://www.cosmologyathome.org/">https://www.cosmologyathome.org/</a> <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p><a href="https://setiathome.berkeley.edu/">https://setiathome.berkeley.edu/</a> <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 25 May 2020 00:00:00 +0200
https://www.aquila-consortium.org/method/scola.html
https://www.aquila-consortium.org/method/scola.htmlmethodWhy neural networks don’t work and how to use them<h1 id="neural-networks-as-universal-model-approximators">Neural networks as universal model approximators</h1>
<p>We can think of a neural network, <script type="math/tex">\mathbb{NN}(\boldsymbol{w}, \boldsymbol{\alpha}) : {\bf d}\to\boldsymbol{\tau}</script>, as an approximation of a model, <script type="math/tex">\mathcal{M} : {\bf d}\to{\bf t}</script>, where <script type="math/tex">{\bf d}</script> is some input data to the network and the output of the network is <script type="math/tex">\boldsymbol{\tau}</script> which is an estimate of some target, <script type="math/tex">{\bf t}</script>, associated with the data. The neural network itself is a function of some trainable parameters called weights, <script type="math/tex">\boldsymbol{w}</script>, and some hyperparameters, <script type="math/tex">\boldsymbol{\alpha}</script>, which encompass the architecture of the network, the initial values of the weights, the form of activation functions, the choice of cost function, etc.</p>
<h1 id="likelihood-of-obtaining-targets-given-a-network">Likelihood of obtaining targets given a network</h1>
<p>In a traditional sense, the training of a neural network is equivalent to minimising a <em>cost</em> or <em>loss</em> function, <script type="math/tex">\Lambda({\bf t}, \boldsymbol{\tau})</script>, with respect to the weights of the network, <script type="math/tex">\boldsymbol{w}</script> (and hyperparameters, <script type="math/tex">\boldsymbol{\alpha}</script>) given a set of pairs of data and targets for training and validation, <script type="math/tex">\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}</script> and <script type="math/tex">\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}</script>. The cost function, <script type="math/tex">\Lambda({\bf t}, \boldsymbol{\tau})</script>, measures how close the outputs of a fixed network, <script type="math/tex">\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to\boldsymbol{\tau}</script>, are to some target, <script type="math/tex">{\bf t}</script>, given a data-target pair, <script type="math/tex">\{ {\bf d}, {\bf t}\}</script>, at some fixed network parameters and hyperparameters, <script type="math/tex">\boldsymbol{w}=\boldsymbol{w}^*</script> and <script type="math/tex">\boldsymbol{\alpha}=\boldsymbol{\alpha}^*</script>. That is, how likely is it that the output of the network provides the true target for the input data given a chosen set of weights and fixed network hyperparameters, i.e. the cost function is equivalent to the (negative logarithm of the) likelihood function</p>
<script type="math/tex; mode=display">\Lambda({\bf t}, \boldsymbol{t})\simeq-\textrm{ln}\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w}^*,\boldsymbol{\alpha}^*).</script>
<p class="figure"><img src="/assets/posts/nn/likelihood.svg" alt="Wibbly likelihood surface" />
<em>The likelihood surface, although regular for a given set of network parameters and hyperparameters, is extremely complex, degenerate, and even discrete and non-convex in the directions of the network parameters and hyperparameters.</em></p>
<p>Although the cost function is normally chosen to be convex, i.e. with a global minimum and defined everywhere, at a given value of <script type="math/tex">\boldsymbol{w}=\boldsymbol{w}^*</script> and <script type="math/tex">\boldsymbol{\alpha}=\boldsymbol{\alpha}^*</script>, the shape of the likelihood is extremely complex, degenerate and bumpy when considering all possible <script type="math/tex">\boldsymbol{w}</script> and will often be discrete and non-convex in the <script type="math/tex">\boldsymbol{\alpha}</script> direction.</p>
<h2 id="maximum-likelihood-network-parameter-estimates">Maximum likelihood network parameter estimates</h2>
<p>The normal procedure for using neural networks is to <em>train</em> them. This means finding the maximum likelihood estimates of the weights of a network with a given set of training data-target pairs <script type="math/tex">\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}</script> and fixed hyperparameters, <script type="math/tex">\boldsymbol{\alpha}=\boldsymbol{\alpha}^*</script>, by doing</p>
<script type="math/tex; mode=display">\boldsymbol{w}^\textrm{MLE}=\underset{\boldsymbol{w}}{\textrm{argmax} }\left[\mathcal{L}(\{ {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\vert \{ {\bf d}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}, \boldsymbol{w}, \boldsymbol{\alpha}^*)\right].</script>
<p>That is, find the set of <script type="math/tex">\boldsymbol{w}</script> for which the likelihood function evaluated at every member in the training set is maximum. In the case that each pair of data and targets, <script type="math/tex">\{ {\bf d}, {\bf t}\}</script> are independent and identically distributed we can write the likelihood as</p>
<script type="math/tex; mode=display">\mathcal{L}(\{ {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\vert \{ {\bf d}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}, \boldsymbol{w}, \boldsymbol{\alpha}^*)=\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w},\boldsymbol{\alpha}).</script>
<p class="figure"><img src="/assets/posts/nn/mle.gif" alt="Stochastic gradient descent" />
<em>By finding the set of <script type="math/tex">\boldsymbol{\tau}</script> which are closest (in the sense of the minimum cost function) to the target <script type="math/tex">{\bf t}</script>, given some a neural network and some input data <script type="math/tex">{\bf d}</script>, the weights of the network traverse the negative logarithm of the likelhiood surface for the true target, hopefully ending at some minimum (which is a maximum in the likelihood).</em></p>
<p>To find the maximum likelihood of the weights, one would normally consider some sort of stochastic gradient descent. Since most software is more efficient at finding minima rather than maxima, we actually minimise the negative logarithm of the likelihood, i.e. the cost function</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\boldsymbol{w}^\textrm{MLE}&=\underset{\boldsymbol{w}}{\textrm{argmin} }\left[\sum_i^{n_\textrm{train} }\Lambda({\bf t}^\textrm{train}_i, \boldsymbol{\tau}^\textrm{train}_i)\right]\\
&=\underset{\boldsymbol{w}}{\textrm{argmin} }\left[-\sum_i^{n_\textrm{train} }\textrm{ln}\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w}, \boldsymbol{\alpha}^*)\right].
\end{align} %]]></script>
<p>The weights are updated using <script type="math/tex">\boldsymbol{w}\to\boldsymbol{w}-\nabla_\boldsymbol{w} \sum_i^{n_\textrm{train} }\ \textrm{ln}\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w}, \boldsymbol{\alpha}^*)</script>. In the ideal case there would be one global minimum in the likelihood so that after training the value of the weights of the neural network would be equal to the maximum likelihood estimates, <script type="math/tex">\boldsymbol{w}=\boldsymbol{w}^\textrm{MLE}</script>. However, since the likelihood surface is, in reality, extremely degenerate and flat in the space of weight values, it is most likely that the weights only achieve a local maximum, i.e. <script type="math/tex">\boldsymbol{w}=\boldsymbol{w}^\textrm{local MLE}</script>. In fact, which local maximum is found will normally depend extremely strongly on the initial <script type="math/tex">\boldsymbol{w}=\boldsymbol{w}_\textrm{init}</script> which is used for the gradient descent.</p>
<p class="figure"><img src="/assets/posts/nn/w_init.gif" alt="Initialisation dependent gradient descent" />
<em>The initialisation of the weights will be very important in determining which local maximum likelihood estimate is found. This is because the surface of the likelihood is very bumpy. It can also be highly degenerate which leads to whole families of pseudo-maximum likelihood estimates.</em></p>
<p>Once the maximum (or at least local maximum) is found, it is normal to evaluate the accuracy (or some other figure of merit) using some validation set, <script type="math/tex">\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}</script>. This validation set is used to modify the hyperparameters, <script type="math/tex">\boldsymbol{\alpha}</script>, of the network to achieve the best fit to both the training and validation sets as possible. These modifications could include changing the initial seeds of the weights, changing the activation functions, or changing the entire architecture, for example. However, networks trained in such a way do not provide a way to obtain scientifically robust estimates of the true targets <script type="math/tex">{\bf t}</script>, given observed data <script type="math/tex">{\bf d}</script>. To see why, we need to consider the probabilistic interpretation of neural networks.</p>
<h1 id="probabilistic-interpretation-of-neural-networks">Probabilistic interpretation of neural networks</h1>
<p>The posterior predictive density of obtaining a target, <script type="math/tex">{\bf t}</script>, given some input data, <script type="math/tex">{\bf d}</script>, is</p>
<script type="math/tex; mode=display">\mathcal{P}({\bf t}\vert {\bf d}) = \int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}).</script>
<p>The likelihood of obtaining the true value of the target <script type="math/tex">\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w},\boldsymbol{\alpha})</script>, which is the (unnormalised) negative exponential of the <em>cost</em> function, when given some input data <script type="math/tex">{\bf d}</script> and network parameters and hyperparameters <script type="math/tex">\boldsymbol{w}</script> and <script type="math/tex">\boldsymbol{\alpha}</script>. <script type="math/tex">\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})</script> is the probability of obtaining the weights and hyperparameters of the neural network. Since the likelihood of obtaining any value of the target, <script type="math/tex">{\bf t}</script>, given some input data, <script type="math/tex">{\bf d}</script>, for any given neural network, i.e. any combination of <script type="math/tex">\boldsymbol{w}</script> and <script type="math/tex">\boldsymbol{\alpha}</script>, is essentially equal then the likelihood, <script type="math/tex">\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})</script>, is almost flat. Therefore, the majority of the information about the posterior predictive density, <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})</script>, comes from the any <em>a priori</em> or <em>a posteriori</em> knowledge of the weights <script type="math/tex">\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha})</script>, and therefore, it has to be chosen or found very carefully.</p>
<p class="figure"><img src="/assets/posts/nn/pp.gif" alt="Pointiness of posterior predictive density" />
<em>The form of the posterior predictive density of the targets <script type="math/tex">{\bf t}</script> depends mostly on the probability of the weights and hyperparameters of the network. This means that the prior for the weights and hyperparameters must be chosen carefully or the posterior extremely well characterised via training data.</em></p>
<p>A Bayesian neural network is a network which provides the true posterior predictive density of targets <script type="math/tex">{\bf t}</script> given data <script type="math/tex">{\bf d}</script>.</p>
<h2 id="failure-of-traditionally-trained-neural-networks">Failure of traditionally trained neural networks</h2>
<p>As described above, given a set of training pairs, <script type="math/tex">\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}</script>, and validation pairs, <script type="math/tex">\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}</script>, we can find the (local) maximum likelihood estimates of the weights, <script type="math/tex">\boldsymbol{w}=\boldsymbol{w}^\textrm{local MLE}</script>, and optimise the hyperparameters to <script type="math/tex">\boldsymbol{\alpha}=\boldsymbol{\alpha}^*</script> which gives the best fit to both the training and validation data-target pair sets. Since we fix both the parameters and hyperparameters, those values are set in stone and we degenerate the posterior distribution to a Dirac <script type="math/tex">\delta</script> function, neglecting any information brought by the training data, i.e.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}|\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}) &\propto\mathcal{L}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})p(\boldsymbol{w},\boldsymbol{\alpha})\\
&\to\delta(\boldsymbol{w}-\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)
\end{align} %]]></script>
<p>where <script type="math/tex">p(\boldsymbol{w},\boldsymbol{\alpha})</script> is a prior distribution over the weights and hyperparameters.
By making such a choice, we erase the entirety of the information about the distribution of data and work only with the best fit model, which may (or may not) be complete.
As such, the predictive probability density of the targets <script type="math/tex">{\bf t}</script> given data <script type="math/tex">{\bf d}</script> is</p>
<script type="math/tex; mode=display">\mathcal{P}({\bf t}\vert {\bf d}) =\delta({\bf t}-\boldsymbol{\tau}({\bf d})),</script>
<p>i.e., the probability of obtaining an estimate from the network is zero everywhere apart from at the value of the output of the network, <script type="math/tex">\mathbb{NN}(\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}^*) : {\bf d}\to\boldsymbol{\tau}</script> - the function is completely deterministic. Effectively, this means that the probability of obtaining <script type="math/tex">{\bf t}</script> given the fixed network parameters and hyperparameters and some data <script type="math/tex">{\bf d}</script> is impossibly small.</p>
<p>Consider a third <em>test</em> set, <script type="math/tex">\{ {\bf d}^\textrm{test}_i, {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}</script>. One normally determines how well a neural network is trained using this unseen (blind) set. To test the network, all of the test data, <script type="math/tex">\{ {\bf d}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}</script>, are passed through the network to get estimates <script type="math/tex">\{\boldsymbol{\tau}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}</script> which can be plotted against the known targets, <script type="math/tex">\{ {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}</script> (see above figure).</p>
<p class="figure"><img src="/assets/posts/nn/nn_w.gif" alt="True vs. Predicted targets" />
<em>For any set of data, a trained neural network with fixed hyperparameters and network parameters at their maximum likelihood values, the probability of obtaining a target is a <script type="math/tex">\delta</script> function. There is no knowledge of whether the output of the network will be equal to the target, and it is, in fact, improbably unlikely that they will be.</em></p>
<p>A network which produces <script type="math/tex">\{\boldsymbol{\tau}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}</script> which correlate very strongly with <script type="math/tex">\{ {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}</script> is probably a network that is in a very good local maximum for both the weights and the hyperparameters. However, there is no assurance that the true <script type="math/tex">{\bf t}</script> should be obtained by the network, and due to the complexity of the likelihood <script type="math/tex">\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})</script>, there is also no way of ensuring that <script type="math/tex">\boldsymbol{\tau}</script> should be similar to <script type="math/tex">{\bf t}</script>. Simply, for complex models, it is not possible to prove that the neural network is equivalent to the model, <script type="math/tex">\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}</script>, and so there is no trust that the network will provide <script type="math/tex">\boldsymbol{\tau}={\bf t}</script>. In fact, because <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})=\delta(\boldsymbol{\tau})</script>, it is improbably unlikely to ever find <script type="math/tex">\boldsymbol{\tau}={\bf t}</script>. For extremely simple architectures it may be possible to prove that at the global maximum likelihood estimates of the weights that <script type="math/tex">\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}</script>, but unfortunately, such simple networks are much less likely to contain the exact representation of <script type="math/tex">\mathcal{M}</script>. Therefore, one can only prove <script type="math/tex">\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}</script> in the limit of infinite data. This is because, in the limit of infinite training data and infinite validation data then we can assume (but not know) that a network could be found (via optimising the hyperparameters over the space of all possible architectures, activation functions, initial conditions of the weights, etc.) which has the capability to exactly reproduce the model <script type="math/tex">\mathcal{M} : {\bf d}\to{\bf t}</script> by finding the true global maximum of the weights over the space of all possible weights in all possible architectures.</p>
<p>An interesting point to make, especially for regression to model parameters, is that one attempts to use the neural network to find a mapping from a many-to-one value space since the same <script type="math/tex">{\bf t}</script> could produce a very large number of different <script type="math/tex">{\bf d}</script>, i.e. the forward model is stochastic. It is an extremely difficult procedure to undo stochastic processes, which is why the neural network will likely never achieve the target function.</p>
<!--### Using MCDropout
MCDropout is a simple extension to traditionally trained neural networks where a probabilistic binary mask $$\boldsymbol{m}$$ is applied to every weight $$\boldsymbol{w}$$ of the network, $$\boldsymbol{w}\to\boldsymbol{mw}$$. The mask can take a value of 0 or 1 given a binomial distribution where a _keep_ value determines what proportion of weights are set to zero. Training is performed in the traditional way where the weights are _dropped_ randomly, which essentially samples some extremely small subset of the hyperparameter space $$\boldsymbol{\alpha}$$, i.e. the subset whose global maximum .
MCDropout is a technique which is often said to approximate Bayesian neural networks. However, we can see this cannot be true since the weights, $$\boldsymbol{w}$$, are fixed and only a very small prior space of $$\boldsymbol{\alpha}$$ is subsampled. In practice, the weights for each subnetwork in the dropped network will not be in even a local maximum of the likelihood for that subnetwork and, as such, not only are the $$\boldsymbol{\tau}$$ not equal to the true targets $${\bf t}$$ given data $${\bf d}$$ but it is likely that they are very far away. In particular, it is very common to obtain spurious modes of certainty for some set of subnetworks.-->
<h2 id="variational-inference-using-approximate-weight-priors">Variational inference using approximate weight priors</h2>
<p class="figure"><img src="/assets/posts/nn/VB.svg" alt="Variational inference network" />
<em>A neural network can be trained via variational inference where parameters of the network predict the parameters of a variational distribution from which the weights for the forward propagation are drawn.</em></p>
<p>All of the problems with the traditional picture arise due to degenerating the probability of the weights and hyperparameters <script type="math/tex">\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})\to\delta(\boldsymbol{w}-\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)</script>. We can recover variational inference by assuming the posterior distribution of the weights becomes an approximate variational distribution, <script type="math/tex">\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})</script>, which approximates posterior of <script type="math/tex">\boldsymbol{w}</script> given a secondary set of network parameters which define the shape of the variational distribution, <script type="math/tex">\boldsymbol{v}</script>, a set of hyperparameters, <script type="math/tex">\boldsymbol{\alpha}</script>, and a set of training data and target pairs, <script type="math/tex">\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}</script>. The posterior predictive density for the targets <script type="math/tex">{\bf t}</script> is then written</p>
<script type="math/tex; mode=display">\mathcal{P}({\bf t}\vert {\bf d})=\int d\boldsymbol{w}d\boldsymbol{v}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})p(\boldsymbol{v},\boldsymbol{\alpha}).</script>
<p>In practice, the parameters controlling the shape of the variational distribution, <script type="math/tex">\boldsymbol{v}</script> and the hyperparameters, <script type="math/tex">\boldsymbol{\alpha}</script> are optimised iteratively using a training and validation set as with the traditional training framework and as such the posterior predictive density becomes</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{P}({\bf t}\vert {\bf d})&=\int d\boldsymbol{w}d\boldsymbol{v}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\\
&\phantom{=hello}\times\delta(\boldsymbol{v}-\boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)\\
&=\int d\boldsymbol{w}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w}, \boldsymbol{\alpha}^*)\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}^*, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}).
\end{align} %]]></script>
<p class="figure"><img src="/assets/posts/nn/vi_w.gif" alt="True vs. variational targets" />
<em>When the posterior distribution for the weights and hyperparameters of a neural network are approximated using a variational distribution, the posterior predictive density for the targets given some data has a form dictated mostly by the shape of the variational distribution. This shape is not necessarily correct since only simple distributions are usually used for the variational distribution and the distribution of weights can be extremely complex.</em></p>
<p>In principle, if <script type="math/tex">\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}^*, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})</script> well represents the true posterior of the weights and hyperparameters, <script type="math/tex">\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})</script>, then this can be a good approximation. However, this is very dependent on the distributions which <script type="math/tex">\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})</script> can represent. <script type="math/tex">\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})</script> is normally chosen to be Gaussian, or perhaps a mixture of Gaussians. As discussed already, the likelihood of obtaining any set of weights, <script type="math/tex">\boldsymbol{w}</script>, is actually extremely bumpy and degenerate and, as such, <script type="math/tex">\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})</script> must be chosen to be able to properly represent this. If <script type="math/tex">\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})</script> is poorly proposed then the posterior predictive density of the targets, <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})</script>, will be incorrect.</p>
<p class="figure"><img src="/assets/posts/nn/wrong_variational_w.svg" alt="Poor variational distribution" />
<em>The variational distribution often does not have enough complexity to fully model the intricate nature of the true posterior distribution of weights and hyperparameters. This can lead variational inference te be misleading.</em></p>
<h2 id="bayesian-neural-networks">Bayesian neural networks</h2>
<p class="figure"><img src="/assets/posts/nn/Bayes.svg" alt="Bayesian neural network" />
<em>A Bayesian neural network is similar a traditional one apart from the distribution of the weights (and hyperparameters) of the network are characterised by the posterior for the weights and hyperparameters given a set of training data.</em></p>
<p>An effective Bayesian neural network can be be built if we use the true posterior of <script type="math/tex">\boldsymbol{w}</script> and <script type="math/tex">\boldsymbol{\alpha}</script> given some training data, rather than degenerating it to a Dirac <script type="math/tex">\delta</script>, and instead keeping</p>
<script type="math/tex; mode=display">\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\propto\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i,\boldsymbol{w},\boldsymbol{\alpha})p(\boldsymbol{w},\boldsymbol{\alpha}).</script>
<p>With this, the predictive probability density of <script type="math/tex">{\bf t}</script> given <script type="math/tex">{\bf d}</script> becomes</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{P}({\bf t}\vert {\bf d}) =&~\int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\\
\propto&~\int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i,\boldsymbol{w},\boldsymbol{\alpha})p(\boldsymbol{w},\boldsymbol{\alpha}).
\end{align} %]]></script>
<p>Obviously the Bayesian neural network comes at a much higher computational cost than just finding the maximum likelihood estimate for the weights, but it does provide a more reasoned posterior predictive probability density, <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})</script>. Notice that the prior, <script type="math/tex">p(\boldsymbol{w},\boldsymbol{\alpha})</script>, still enters and so we need to make an informed decision on our belief for what the values of <script type="math/tex">\boldsymbol{w}</script> and <script type="math/tex">\boldsymbol{\alpha}</script> should be. However, for enough training data-target pairs (and enough time to sample through whatever chosen prior, <script type="math/tex">p(\boldsymbol{w},\boldsymbol{\alpha})</script>) the posterior <script type="math/tex">\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})</script> becomes informative enough to obtain useful posterior predictions for the targets.</p>
<p class="figure"><img src="/assets/posts/nn/dd.gif" alt="Characterising the posterior" />
<em>For small numbers of data points, the likelihood is poorly characterised and so can lead to biasing in the posterior predictive density. It is therefore important to have enough data to properly know the likelihood - it is not easy to determine how much this is.</em></p>
<p>In effect, to make use of Bayesian neural networks, one has to resort to sampling techniques, such as Markov chain Monte Carlo, to describe <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})</script>. Because of the (normally extremely large) dimension of the number of weights, techniques such as Metropolis-Hastings cannnot be considered. We proposed using a second-order geometrical adaptation of Hamiltonian Monte Carlo (QN-HMC) in Charnock et al. 2019 (<a href="/method/machine%20learning/npe.html">read more</a>). By using such a sampling technique, one could generate samples for the posterior predictive density, <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})</script>, whose distribution describes what was the probability of getting a target <script type="math/tex">{\bf t}</script> from data <script type="math/tex">{\bf d}</script> marginalised over all network parameters <script type="math/tex">\boldsymbol{w}</script> given a hyperparameter, <script type="math/tex">\boldsymbol{\alpha}=\boldsymbol{\alpha}^*</script><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. It is difficult to sample <script type="math/tex">\boldsymbol{\alpha}</script> when using the QN-HMC since gradients of the likelihood need to be computed and the likelihood in the <script type="math/tex">\boldsymbol{\alpha}</script> direction is often discrete. How to properly sample from <script type="math/tex">\boldsymbol{\alpha}</script> is still up for debate.</p>
<p>So now lets say we have enough computational power to build a true Bayesian neural network. Are we guaranteed to obtain a correct posterior predictive density?</p>
<h1 id="source-of-the-problem">Source of the problem</h1>
<h2 id="training-on-data">Training on data</h2>
<p>Notice how all of the techniques mentioned above are dependent on a set of training data and target pairs, <script type="math/tex">\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}</script> (and possibly validation data and targets, <script type="math/tex">\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}</script>). It is in the posterior (or variational distribution) for the weights that the training data arises</p>
<script type="math/tex; mode=display">\mathcal{P}({\bf t}\vert {\bf d}) = \int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})</script>
<p>and, as already explained, the last term in the integral contains the informative part about the posterior predictive density. As such, any biasing due to <script type="math/tex">\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}</script> greatly affects <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})</script>.</p>
<p>When depending on a training set, <script type="math/tex">\mathcal{P}({\bf t}\vert {\bf d})</script> is always unknowably biased until the limit of infinite data is reached. So, no method mentioned so far provides us with the correct probability of obtaining the target!</p>
<p>For networks, such as emulators (or generative networks as they are commonly called), where the probability distribution of generating targets, <script type="math/tex">\mathcal{\bf P}({\bf t}\vert {\bf z})</script>, with generated data <script type="math/tex">{\bf t}</script> and a latent distribution <script type="math/tex">{\bf z}</script>, should approximate the distribution of true data <script type="math/tex">\mathcal{P}({\bf d})</script>, then the above argument means that we cannot find <script type="math/tex">\mathcal{P}({\bf d})</script> by training a neural network without infinite training data<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<h2 id="incorrect-models">Incorrect models</h2>
<p>One interesting use for neural networks is the predicting of physical model parameters, <script type="math/tex">\boldsymbol{\theta}</script>, for a model <script type="math/tex">\mathcal{M} : \boldsymbol{\theta}\to{\bf d}</script>. In this case, even for infinite data, we cannot obtain true posterior distributions for the parameters. Take a network which maps <script type="math/tex">\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to\hat{\boldsymbol{\theta}}</script>, where <script type="math/tex">\hat{\boldsymbol{\theta}}</script> are estimates of the model parameters, <script type="math/tex">\boldsymbol{\theta}</script>, which generate the data. Even if there is infinite training data, <script type="math/tex">\{ {\bf d}^\textrm{train}_i, \boldsymbol{\theta}^\textrm{train}_i\vert i\in[1,\infty]\}</script>, if the original model is incorrect, then the neural network will be conditioned on the wrong map from data, <script type="math/tex">{\bf d}</script>, to parameters, <script type="math/tex">\boldsymbol{\theta}</script>, and so any observed data, <script type="math/tex">{\bf d}^\textrm{obs}</script>, passed through the network will be passed through the incorrect approximation of the model and provide a poor estimate of the incorrect model parameter values. This means that true posteriors on the model parameters can only be obtained with the exact model which generates the <em>observed</em> data <strong>and</strong> an infinite amount of training data from that model, to be able to correctly provide parameter estimates.</p>
<p><strong>This is not realistic!</strong></p>
<h1 id="solutions">Solutions</h1>
<p>We have so far built a description of how to obtain the probability to obtain targets, <script type="math/tex">{\bf t}</script>, from data, <script type="math/tex">{\bf d}</script>, passed through a neural network… and unfortunately, we have learned that it is not possible to obtain.</p>
<p>There is still one problem where we can use neural networks safely despite all of the above. This is to do model parameter inference.</p>
<p>So far we have only considered a neural network as an approximation to a model <script type="math/tex">\mathcal{M} : {\bf d}\to{\bf t}</script>. Now lets say we have a physical model, <script type="math/tex">\mathcal{Z}(\boldsymbol{\iota}) : \boldsymbol{\theta}\to{\bf d}</script>, which generates the data, <script type="math/tex">{\bf d}</script> from a set of model parameters, <script type="math/tex">\boldsymbol{\theta}</script>, dependent on a set of initial conditions <script type="math/tex">\boldsymbol{\iota}</script>, we can safely use a neural network, <script type="math/tex">\mathbb{N}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to\boldsymbol{\tau}</script>, to infer the model parameters of some observed data, <script type="math/tex">{\bf d}^\textrm{obs}</script>. Note that we cannot use a network to predict model parameters directly <script type="math/tex">(\mathbb{NN} : {\bf d}\to\boldsymbol{\theta})</script> due to all of the arguments above. Instead we need to set up a statistical inference framework which encompasses the neural network.</p>
<p>Charnock et al. 2019 and Charnock, Lavaux and Wandelt 2018 show two different methods to perform physical model parameter inference using neural networks, in a well justified way.</p>
<h2 id="writing-down-the-likelihood">Writing down the likelihood</h2>
<p>I should mention an extremely rare case where the model <script type="math/tex">\mathcal{M} : {\bf d}\to{\bf t}</script>, is simple enough to be parameterised by an extremely simple network with very few parameters, which are non-degenerate and well behaved and for which the hyperparameters, <script type="math/tex">\boldsymbol{\alpha}</script>, can be well designed to avoid needing to sample over this space.</p>
<p>For this case, the likelihood could be written, and therefore, fully established and sampled from, and biases from training data-target pairs could be totaly avoided.</p>
<p><strong>It is pretty unlikely that such a network could be found without considering physical principles.</strong></p>
<h2 id="model-extension">Model extension</h2>
<p>In Charnock et al. 2019, the connection between the observed data and the output of the physical model is not known, i.e. the data from a model given initial conditions, <script type="math/tex">\boldsymbol{\iota}</script>, is <script type="math/tex">\mathcal{Z}(\boldsymbol{\iota}) : \boldsymbol{\theta}\to{\bf d}</script>. This <script type="math/tex">{\bf d}</script> does not look like <script type="math/tex">{\bf d}^\textrm{obs}</script> although we know that want the posterior distribution of <script type="math/tex">\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs})</script>. In Charnock et al. 2019, we know we can observe the universe and model the underlying dark matter of the universe, but the complex astrophysics which maps the dark matter of the universe to the observable tracers is unknown. We do, however, know some physical properties of this mapping. In this case, we build a neural network with the physically motivated symmetries to take the output of the physical model to the distribution which is as close to the observed data as possible (<a href="/method/machine%20learning/npe.html">read more</a>). In the language used previously, thanks to the problems we deal with in cosmology and astrophysics we can actually choose the hyperparameters of a neural network, <script type="math/tex">\boldsymbol{\alpha}</script>, in a reasoned manner. These physically motivated neural networks therefore massively reduce the volume of the <script type="math/tex">\boldsymbol{\alpha}</script> domain. With a careful choice of <script type="math/tex">\boldsymbol{\alpha}</script> we can also build a network whose priors on the network paremeters, <script type="math/tex">\boldsymbol{w}</script>, can be (at least reasonably) well informed.</p>
<p>We can write the parameter inference as</p>
<script type="math/tex; mode=display">\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs}) \propto \int d\boldsymbol{\iota}d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf d}^\textrm{obs}\vert \boldsymbol{\iota},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{P}(\boldsymbol{\iota}\vert \boldsymbol{\theta})p(\boldsymbol{w},\boldsymbol{\alpha})</script>
<p>That is, the posterior distribution for the model parameters given some observed data is proportional to the marginal distribution of how likely the observed data is given the initial conditions of the model, <script type="math/tex">\boldsymbol{\iota}</script>, which depend on the model parameters, <script type="math/tex">\boldsymbol{\theta}</script>, which generate the initial conditions and evolve the model forward to the input of the neural network with network parameters and hyperparameters <script type="math/tex">\boldsymbol{w}</script> and <script type="math/tex">\boldsymbol{\alpha}</script>.</p>
<p>In this presented case, there is no training data for the network, instead the data needed to obtain the posterior is part of the statistical framework. Therefore, the network provides non-agnostic posterior parameter inference because we do not learn the posterior distribution, <script type="math/tex">\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha})</script> using training data. In essence, this defines the procedure to perform zero-shot training.</p>
<p>It should be noted that this procedure is difficult. It necessitates a sampling scheme for the neural network and the physical model. In Charnock et al. 2019, we use an advanced Hamiltonian Monte Carlo sampling technique on a model where we have calculated the adjoint gradient and the neural network whose architecture is well informed but fixed.</p>
<h2 id="likelihood-free-inference">Likelihood-free inference</h2>
<p>The model extension method works well, but still depends on knowing the form of the likelihood of the observed data. In practice, this could be extremely difficult. It also depends on a choice of hyperparameters (or at least a well defined prior based on physical principles). In Charnock, Lavaux and Wandelt 2018, we showed another model extension method which allows use to obtain optimal model parameter inference using neural networks by (semi)-classically training a neural network, <script type="math/tex">\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to{\bf t}</script>, where the target distribution is the set of Gaussianly distributed summaries which maximise the Fisher information matrix. Although the network in this work is, in some way, optimal - the main point of this paper is that parameter inference can be done using likelihood-free inference by extending the physical model <script type="math/tex">\mathcal{M} : \boldsymbol{\theta}\to{\bf d}</script> to <script type="math/tex">\mathcal{N} :\boldsymbol{\theta}\to{\bf t}</script> where <script type="math/tex">{\bf t}</script> is <em>any</em> set of summaries.</p>
<p>Likelihood-free inference is a framework where, via generating data using the physical model, <script type="math/tex">\mathcal{M} : \boldsymbol{\theta}\to{\bf d}</script>, the joint probablity of data and parameters, <script type="math/tex">\mathcal{P}({\bf d},\boldsymbol{\theta})</script>, can be characterised. Once this space is well defined, a slice through the distribution at any <script type="math/tex">{\bf d}^\textrm{obs}</script> gives the posterior distribution <script type="math/tex">\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs})</script> - likewise the slice through the joint distribution at any parameter <script type="math/tex">\boldsymbol{\theta}^*</script> gives the likelihood distribution <script type="math/tex">\mathcal{L}({\bf d}\vert \boldsymbol{\theta^*})</script>. This works for any system where we can model the data!</p>
<p>The neural networks become essential as functions which perform data compression (although, it should be noted that any summary of the data will work). Since, in general, the dimensionality of the data is much larger than the number of model parameters, a neural network can be trained to compress the data in some way, <script type="math/tex">\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to{\bf t}</script>. We can train this in any way to give us some absolute summaries, <script type="math/tex">{\bf t}</script>, where we, essentially, do not care what the summaries are. Note that <script type="math/tex">\boldsymbol{w}</script> and <script type="math/tex">\boldsymbol{\alpha}</script> do not need to be maximum likelihood estimates. By pushing all the generated data from the physical model through this <em>fixed</em> network we can characterise the probability distribution of parameters and compressed summaries, <script type="math/tex">\mathcal{P}({\bf t},\boldsymbol{\theta})</script>, which we can slice at any <script type="math/tex">\boldsymbol{\theta}^*</script> to give the likelihood of obtaining any summaries, <script type="math/tex">\mathcal{L}({\bf t}\vert \boldsymbol{\theta}^*)</script>, or (more interestingly) slice at any observed data pushed through the network, <script type="math/tex">\mathbb{NN}(\boldsymbol{w}^*, \boldsymbol{\alpha}^*) : {\bf d}^\textrm{obs}\to{\bf t}^\textrm{obs}</script>, to get the posterior,</p>
<script type="math/tex; mode=display">\mathcal{P}(\boldsymbol{\theta}\vert {\bf t}^\textrm{obs})=\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs},\boldsymbol{w}^*,\boldsymbol{\alpha}^*).</script>
<p>This posterior, whilst conditional on the network parameters and hyperparameters, is unbiased in the sense that when the neural network, <script type="math/tex">\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to{\bf t}</script> is not optimal, the posterior can only become inflated (and not incorrectly biased).</p>
<p>The information maximising neural network, presented in Charnock, Lavaux and Wandelt 2018, provides the optimal summaries<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> for the likelihood-free inference - but any neural network can be used in this inference framework. In particular, any neural network which looks like it provides good estimates of the targets for a model <script type="math/tex">\mathcal{M} : {\bf d}\to{\bf t}</script> (as discussed throughout), will likely have extremely informative summaries, even if their outputs are improbably unlikely to be equal to the true target values (see traditionally training neural networks)!</p>
<h1 id="conclusions">Conclusions</h1>
<p>Presented here is a thorough statistical diagnostic of neural networks. I have shown that, by design, neural networks cannot provide realistic posterior predictive densities for arbitrary targets. This essentially makes all neural networks unusable in science.</p>
<p>However, I have presented how my previous works can undermine this previous statment for model parameter inference. Since either a statistical interpretation or a fully trained neural network can be appended to a physical model, we can build a statistical framework around both the model and the neural network to allow us to do rigorous, scientific analysis of model parameters, which is one of the essential tasks in science today.</p>
<hr />
<p>Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, Supranta Sarma Boruah, Jens Jasche, Michael J. Hudson, <b>2019</b>, submitted to MNRAS, arXiv:1909.06379</p>
<p>Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, <b>2018</b>, Physical Review D 97, 083004 (2018), arxiv:1802.03537</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>It should be noted that the work in Charnock et al. 2019 was tackling a larger problem and asking a different question than the one stated here for Bayesian neural networks. Bayesian neural networks are a subset of the techniques from that paper, although closely linked. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>We can hope that the generated target distribution gets close to the true data distribution and decide we are not bothered about statistics anymore. Maybe a dangerous situation for science‽ <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Optimal in the sense that the Fisher information is maximised. This has some assumptions such as the unimodality (but not necessarily Gaussianity) of the posterior, and the fact that the neural network being maximised is capable of finding a function which Gaussianises the data. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 07 Dec 2019 00:00:00 +0100
https://www.aquila-consortium.org/method/machine%20learning/nn.html
https://www.aquila-consortium.org/method/machine%20learning/nn.htmlmethodmachine learningNeural physical engines for inferring the halo mass distribution function<p>To be able to make the most of the wealth of cosmological information available via observations of the large scale structure of the universe it is vital to have a strong model of how observable objects such as galaxies trace the underlying dark matter.
In this work we used a neural bias model: a physically motivated neural network from which we can infer the halo mass distribution function.
This function describes the abundance of halos with a certain mass given a dark matter density environment, where the halos are compact dark matter objects in which galaxies are hosted.
As such, the neural bias model gives us a strong, but agnostic, bias model mapping the dark matter density field to (tracers of) the observable universe.
Such a neural bias model can be included in the BORG inference scheme such that the initial conditions of the dark matter density and the parameters of the neural bias model are sampled using Hamiltonian Monte Carlo.</p>
<h1 id="halo-mass-distribution-function">Halo mass distribution function</h1>
<p>The halo mass distribution function describes the number of dark matter halos at a certain mass given a dark matter density environment.
It has been well studied in the past, and as such we know the approximate form of the function, which is described by the Press Schechter formalism which is a power law at small masses with an exponential cut off at high masses.
There are less well understood elements also, including how the non-local density environment affects the abundance of halos and the form of the stochasticity from which halos are drawn from the halo mass distribution function.
This stochasticity describes how one obtains the actual number of observed halos of a certain mass given that the halo mass distribution function only describes the probability of observing such a halo.
The sampling of halos from the halo mass distribution function is normally assumed to be Poissonian, but this is known to be insufficient.
Whilst we consider a Poissonian likelihood in this work, it should be noted that it is Poisson for a field of summaries provided by a neural physical engine and so includes information from the local surrounding region.</p>
<h1 id="zero-shot-training-bayesian-neural-networks">Zero-shot training, Bayesian neural networks</h1>
<p>The neural network used in this work is not pre-trained and is conditioned on the observed data only, in this case a halo catalogue obtained from a high resolution dark matter simulation.
Zero-shot training describes a method of fitting a function without any training data.
Several components are necessary to be able to achieve such a fitting of the neural bias model introduced here.
These are: basing the design of the architecture of the network on physical principles; using appropriate functions to model the form of the halo mass distribution function; and finding a stable sampling procedure to obtain parameter samples from the posterior.</p>
<h2 id="neural-physical-engines">Neural physical engines</h2>
<p>Neural physical engines are simply neural networks that are built using physical principles.
For example, with a physical model of how some data is distributed according to the parameters of a model, one builds a neural network with the symmetries of such a model built into its architecture.
This is particularly useful for several reasons.
Primarily, such a neural physical engine is massively protected from overfitting.
Overfitting is prevented because only relevant information for the problem in hand is allowed to be fitted, and the network is insensative to spurious features of the data, such as noise.
An added benefit to these networks is the massive reduction in the number of parameters necessary to fit the required function.
This improves the computational efficiency of the algorithm, decreases training times and increases the interpretability of the network.</p>
<p class="figure wide"><img src="/assets/posts/npe/NPE.svg" alt="Neural physical engine" />
<em>The neural physical engine is a physically motivated neural network which maps a dark matter density distribution, evolved by Lagrangian perturbation theory, to a set of summaries which are informative about the abundance of halos of a certain mass on the grid.</em></p>
<p>When building the neural bias model we construct a neural physical engine which takes a small patch of the gridded dark matter density field evolved from the initial conditions to today using Lagrangian perturbation theory as an input and outputs a single informative summary per voxel about the abundance of halos with a certain mass at that patch of the dark matter density field.
We know that the halo mass distribution function is only sensitive to local information, and at the resolution we are working at, mostly due to the amplitude of the dark matter density field rather than the exact position of structures such as filaments or nodes in the dark matter field.
We also know that the data is distributed evenly across the volume, i.e. there is translational and rotational invariance in the dark matter density field.
This encourages us to use parameterised three-dimensional convolutional kernels with an extent which is only as large as the relevant scales and where the parameters are shared within the kernels according to a radial symmetry.</p>
<p class="figure wide"><img src="/assets/posts/npe/kernels.svg" alt="Multipole expansion of convolutional kernel" />
<em>The convolutional kernels used in neural networks are discrete and gridded, with each element of the array being an independent trainable parameter.
We introduce a method by which we can expand the kernels in terms of multipoles by associating weights at equal distances (and at given rotational angles) from the centre of the kernel.
Take for example a 3x3x3 convolutional kernel.
Normally this would have 27 free parameters.
By looking at the radially symmetric kernel, i.e. ℓ=0, each corner has an associated weight, as does each edge and each face and there is a single weight for the central element, equating to a total of 4 free parameters.
Then in the case of the dipolar kernel, i.e. ℓ=1, there are three independent kernels each with 3 parameters, making a total of 9.
For ℓ=2, there are now 5 independent kernels with 2 parameters each and including ℓ=3 saturates the freedom of the convolutional kernel and so no further multipoles are needed to fully parameterise the general kernel.
We can use this expansion to either reduce the number of parameters necessary by truncating in multipoles, or we can learn more about the informational content of the data in terms of expansion in multipoles.
In the second case, once trained, one can look at the response of the data in independent multipole paths, the larger the response the more informative that multipole is about the roll of the data in the neural network.
The code for producing the multipole kernels can be found at <a href="https://github.com/tomcharnock/multipole_kernels">github:multipole_kernels</a>.</em></p>
<p class="figure wide"><img src="/assets/posts/npe/receptive_field.svg" alt="Importance of receptive field" />
<em>The size of the convolutional kernel used is extremely important for a neural physical engine.
The size of the kernel is known as the receptive field, and dictates the size of the correlations which can be learned by the neural network. The receptive field should be chosen based on the data. If it is too small then it is impossible to learn about relevant features in the data and will tend to average out even the small scale features since it cannot distiguish the large scale modes. Likewise, if the receptive field is too large then the kernel will be massively overparameterised which can lead to overfitting and the fitting of spurious large scale features of the data. Since these large scale features are less common they are therefore less likely to be averaged out during training.This leads to a network which is difficult to train and has a much larger computational cost.
It should be noted that stacking convolutions leads to a larger receptive field throughout the network, but does not protect one from the above problems. The kernel size should be chosen carefully at each layer make the most of the distribution of information at each layer independently (this can be very tricky to do).</em></p>
<h2 id="neural-density-estimators">Neural density estimators</h2>
<p>Since we wish to model the halo mass distribution function we need to consider an architecture whose output is a function (or at least an evaluation of the function).
To do so we use a modified mixture density network which is a type of neural density estimator.
Neural density estimators are neural networks whose outputs are samples from a fitted probabililty distribution function.
For the halo mass distribution function we use a mixture of two Gaussian distributions where we allow the predicted amplitudes to be free positive parameters but organise the predicted mean parameters in order of magnitude.
This breaks the degeneracy between the two Gaussians and allows us to have a smooth function whose amplitude can accurately approximate the abundance of halos.</p>
<p class="figure wide"><img src="/assets/posts/npe/MDN.svg" alt="Mixture density network" />
<em>A mixture density network is a neural network which maps an input to a set of parameters for a collection of probability distributions. For example, one can predict the means, μ, standard deviations, σ, and amplitudes, α, of several Gaussian distributions and sum these Gaussians together. Provided that the amplitudes sum to 1, the mixture density will remain correctly normalised to be interpreted as a probability distribution. The mixture density network can then be trained by evalutating the value of the distribution at the labels for the input data and minimising the negative logarithm of the distribution.</em></p>
<h2 id="likelihood-mathcallboldsymbolthetaboldsymboldelta_textsflpt">Likelihood, <script type="math/tex">\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{\delta}_\textsf{LPT})</script></h2>
<p>To fit the halo mass distribution to the halo catalogue used in this work we consider a Poisson likelihood.
If our evolved dark matter density field, <script type="math/tex">\boldsymbol{\delta}_\textsf{LPT}</script>, is passed through the neural physical engine, with parameters <script type="math/tex">\boldsymbol{\theta}_\textsf{NPE}</script>, to get a field of summaries, <script type="math/tex">\boldsymbol{\psi}_\textsf{NPE} = \boldsymbol{\psi}_\textsf{NPE}(\boldsymbol{\delta}_\textsf{LPT}, \boldsymbol{\theta}_\textsf{NPE})</script>, our halo mass distribution function is given by</p>
<script type="math/tex; mode=display">n(M|\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_\textsf{MDN})= \sum_{i=1,2} \alpha(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)\mathcal{N}(M| μ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i),σ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)),</script>
<p>where <script type="math/tex">\mathcal{N}(M|μ, σ)</script> is the value of a Gaussian with mean <script type="math/tex">\mu</script> and standard deviation <script type="math/tex">\sigma</script> evaluated at halo mass <script type="math/tex">\textsf{log}(M)</script>.
The Poisson likelihood can be written as two terms.
The first term evaluates the neural halo mass distribution function for every halo in the catalogue, where the density environment is obtained from the patch of <script type="math/tex">δ_\textsf{LPT}</script> around each voxel index corresponding to each halo.
This term therefore fits the abundance scale due to the catalogue.
The second term is the integral over halo mass of the whole function for the entire evolved density field and therefore fits the shape of the function.</p>
<p>Note that by using this likelihood we never have to explicitly make a stochastic sampling of the halos to compare to the catalogue, although we could use the fitted halo mass distribution function to generate halo catalogues by using the value of the evaluated neural bias model as the rate parameter for Poisson sampling.</p>
<p>We will also include a Gaussian prior, <script type="math/tex">\pi(\boldsymbol{\theta})</script>, on all the parameters of the neural bias model.
We ensure that these weights and biases are centred on zero by rescaling them using prior knowledge of the amplitude of the abundance measured from the halo catalogue and the halo mass threshhold.
Since the parameters of the neural bias model are centred on zero, we just need to a width to the Gaussian prior which is large enough to allow for parameter exploration, but tight enough to make sampling the parameters feasible.</p>
<h2 id="hmclet">HMCLET</h2>
<p>To be able to sample the weights of the neural bias model we use a modified Hamiltonian Monte Carlo.
Hamiltonian Monte Carlo is a way of efficiently drawing samples from extremely large dimensional likelihood distributions.
One starts with an initial set of neural bias model parameters, <script type="math/tex">\boldsymbol{\theta}_0</script>, and proposes a new set, <script type="math/tex">\boldsymbol{\theta}^*</script>, given a momentum, <script type="math/tex">{\bf p}</script>, drawn from a proposal distribution, <script type="math/tex">{\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})</script>.
M is a mass matrix which describes the time scale along the parameter direction and correlation between the parameters.
One then solves Hamilton’s equations, <script type="math/tex">d\boldsymbol{\theta}/dt = {\bf M}^{-1}{\bf p}</script> and <script type="math/tex">d{\bf p}/dt = -\nabla \mathcal{V}(\boldsymbol{\theta})</script> where the Hamiltonian is described by <script type="math/tex">\mathcal{H}(\boldsymbol{\theta}, {\bf p}) = \mathcal{V}(\boldsymbol{\theta}) + \mathcal{K}(\boldsymbol{p})</script>, with <script type="math/tex">\mathcal{V}(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta}|\delta_\textsf{LPT}) + \pi(\boldsymbol{\theta})</script> as the potential energy formed from the likelihood and the prior and <script type="math/tex">\mathcal{K}(\boldsymbol{p}) = -{\bf p}^\textsf{T}{\bf M}^{-1}{\bf p}</script> as a kinetic energy.
Proposed parameters are then excepted according to a probablity given by <script type="math/tex">\alpha = \textsf{Min}[\textsf{exp}(\Delta\mathcal{H}), 1]</script>, where <script type="math/tex">\Delta\mathcal{H}</script> is the difference between the energy at the proposed parameter values and the current parameter values.
By conserving energy, one ensures that all proposals are accepted.
It is ususal to use a symplectic integration scheme, such as the leapfrog algorithm (ϵ-discretisation) to solve these ODEs.</p>
<p class="figure wide"><img src="/assets/posts/npe/leapfrog.svg" alt="Leapfrog algorith" />
<em>The leapfrog algorithm involves drawing a momentum from a proposal distribution, <script type="math/tex">{\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})</script>, and taking a step of size <script type="math/tex">\epsilon</script> from the initial parameter positions <script type="math/tex">\boldsymbol{\theta}_0</script> according to <script type="math/tex">{\bf p} = {\bf p} - \epsilon\nabla \mathcal{V}(\boldsymbol{\theta}_0)/2</script> giving <script type="math/tex">\boldsymbol{\theta}_\textsf{next} = \boldsymbol{\theta}_0+\epsilon{\bf M}^{-1}{\bf p}</script>. This makes up the first half step in the leapfrog. The same procedure of updating <script type="math/tex">{\bf p}</script> and <script type="math/tex">\boldsymbol{\theta}</script> occurs N number of steps, where the rest of the steps are full (<script type="math/tex">{\bf p} = {\bf p}-\epsilon\nabla \mathcal{V}(\boldsymbol{\theta})</script>). The last half step is then taken. The choice of ϵ dictates the accuracy of the integration. If <script type="math/tex">\epsilon</script> is large then Hamilton’s equations are solved more inaccurately which can lead to energy loss between the initial and proposed parameters, which increases the rejection. On the other hand, if <script type="math/tex">\epsilon</script> is small then more samples are accepted since there is less (or less likely to be) energy loss, but this comes at a higher computational cost.</em></p>
<p>Since neural networks are complex and in general have a large number of highly somewhat-degenerate parameters, it is very difficult to know the mass matrix <em>a priori</em>.
This means that extremely large steps can be made along the likelihood surface leading to numerical stability issues and improper sampling.
To overcome this, we can consider using the second order geometric information of the likelihood surface by calculating its Hessian using quasi-Newtonian methods.</p>
<p class="figure wide"><img src="/assets/posts/npe/second_order.svg" alt="Flat likelihood and second order geometric information" />
<em>The Hessian (<script type="math/tex">{\bf B}</script>), i.e. the second order gradient, of the likelihood surface can be calculated using quasi-Newtonian methods. Quasi-Newtonian methods are root-finding algorithms where the Hessian (or Jacobian) are approximated. There are many ways to calculate the approximate Hessian, we use the BFGS method in this work. This method is convenient since it can be calculated for free as part of the leapfrog algorithm. When using the second order geometric information the ODEs become <script type="math/tex">d\boldsymbol{\theta}/dt = {\bf B}{\bf M}^{-1}{\bf p}</script> and <script type="math/tex">d{\bf p}/dt = -{\bf B}\nabla \mathcal{V}(\boldsymbol{\theta})</script>. This means that, although the mass matrix is still needed to set the time scales along the parameter directions, the momenta get effectively rescaled by the Hessian, breaking parameter degeneracies and allowing for an efficient acceptance ratio.</em></p>
<h1 id="results">Results</h1>
<p>With a neural bias model formed of a neural physical engine which is sensitive to non-local radial information, a neural density estimator to give us evaluations of suffciently arbitrary functions and a sampling scheme which can effectively explore the complex likelihood landscape we can now infer the halo mass distribution function.</p>
<p class="figure wide"><img src="/assets/posts/npe/NBM_square.svg" alt="Outline of the BORG algorithm" />
<em>The BORG algorithm infers the initial conditions of the dark matter distribution. First the initial conditions are drawn from a prior given a cosmology to generate an initial dark matter density field. In this work, this dark matter density field is then evolved forward using Lagrangian perturbation theory to obtain the dark matter density field today. This is then passed through the neural physical engine to obtain an informative field of summaries about the abundance of dark matter halos on a grid. This can then be compared to the observed halo catalogue via the Poissonian likelihood between the halo mass distribution function provided by the neural density estimator of the neural bias model. Evaluating this likelihood allows us to obtain posterior samples of all of the initial phases of the dark matter density distribution and all of the parameters of the neural bias model.</em></p>
<p>We use a halo catalogue constructed using Rockstar from a chunk of the VELMASS Ω dark matter simulation, which has a Planck-like cosmology. This catalogue has about 10,000 halos with a mass threshhold of 2x10<sup>12</sup> solar masses.</p>
<p>As shown in the figures below, we are able to fit the halo mass distribution function extremely well, with sampling around the observed catalogue. Furthermore, the information used comes from the non-local region around the each voxel in the gridded density field, showing that the surrounding area holds information about the abundance of halos.</p>
<p class="figure wide"><img src="/assets/posts/npe/hmdf.svg" alt="Halo mass distribution function" />
<em>The abundance of halos at a certain mass given a density environment from the VELMASS halo catalogue is plotted using the diamonds with dashed lines.
The more dense the environment, the more halos are expected at all masses.
The solid lines are the mean halo mass distribution function values from the neural bias model.
The filled areas are the 1σ intervals either side of the mean obtained by the samples from the Markov chain.
We can see that the fit is very good (even with the very simple model considered here), and that the shape of the function changes with density environment.
This shows that the neural bias model is able to account for the response of the density field.</em></p>
<p class="figure wide"><img src="/assets/posts/npe/3D_projections.svg" alt="3D projections of the field" />
<em>Here we see an example of an initial density field and the same field evolved using Lagrangian perturbation theory on the top row.
The bottom row shows the effect of the neural physical engine which provides an enhancement in constrast, which is a more informative summary of the abundance of halos than the LPT field.
This is because non-local information is gathered from the surrounding voxels by the neural physical engine.
The last box (bottom right) is the true halos from the VELMASS halo catalogue placed onto the same grid.
Note that the NPE field does not look like the halo distribution since a Poisson sampling of the halo mass distribution function is needed to get a stochastic realisation of the halo distribution.</em></p>
<h1 id="future-work">Future work</h1>
<p>The methods presented in this paper show a state of the art in terms of machine learning as well as new methods for dealing with the bias model in BORG and for generating halo catalogues from the neural bias model.
We will continue our work in two main directions.
The first is to look at bypassing the halos completely by learning the form of the likelihood using some form of neural density estimation (or neural flow) which would allow us to be more agnostic about the form of the likelihood.
This would mean that we could, in principle, marginalise out the effect of the ambiguity in the likelihood to provide robust constraints on the initial density phases and cosmology.
The second is to use architecture optimisation schemes to find a better fit to the halo mass distribution function for use in halo catalogue generation.</p>
<h1 id="references">References</h1>
<ul>
<li>Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, Supranta Sarma Boruah, Jens Jasche, Michael J. Hudson, 2019, submitted to MNRAS, <a href="https://arxiv.org/abs/1909.06379">arXiv:1909.06379</a></li>
</ul>
Tue, 15 Oct 2019 00:00:00 +0200
https://www.aquila-consortium.org/method/machine%20learning/npe.html
https://www.aquila-consortium.org/method/machine%20learning/npe.htmlmethodmachine learningA fifth-force resolution of the Hubble tension<h1 id="background">Background</h1>
<p>At least on large scales, the standard cosmological model suffers from just one <script type="math/tex">>3\sigma</script> inconsistency. This is the Hubble tension: while the local expansion rate inferred from the Cosmic Microwave Background is <script type="math/tex">67.4 \pm 0.5</script> km s<script type="math/tex">^{-1}</script> Mpc<script type="math/tex">^{-1}</script>, <script type="math/tex">H_0</script> measured locally (by combining distance measurements to objects successively further away in a “cosmic distance ladder”) is <script type="math/tex">74.03 \pm 1.42</script> km s<script type="math/tex">^{-1}</script> Mpc<script type="math/tex">^{-1}</script>. This discrepancy is <script type="math/tex">4.4\sigma</script>, and appears to imply some form of new physics that invalidates direct comparison between low and high redshift probes of <script type="math/tex">H_0</script> within <script type="math/tex">\Lambda</script>CDM.</p>
<p>A key assumption in the local measurement of <script type="math/tex">H_0</script> is that the objects that calibrate the distance ladder – primarily Cepheid stars and Type 1a Supernovae – have identical properties between successive rungs. But in a wide variety of beyond-<script type="math/tex">\Lambda</script>CDM cosmological models which invoke so-called “screened fifth forces”, this is likely not true. Rather, while the Cepheids in the Milky Way and NGC 4258 (whose distance is measured independently by means of a water maser) will be screened by the dense environments of their hosts, those at higher redshift that calibrate supernova absolute magnitudes will be unscreened and hence feel the full fifth force. This induces a bias in the Cepheid period–luminosity relation which causes the conventional analysis to underestimate the distance to extragalactic Cepheid hosts, and hence, at fixed redshift, to overestimate <script type="math/tex">H_0</script> (Figure 1). Thus, in such models, the expansion rate measured locally would be more in accord with that inferred from recombination.</p>
<p class="figure wide"><img src="/assets/posts/h0/Fig_1.png" alt="Figure 1" />
<em>Left panel: The rungs of the cosmic distance ladder and their typical screening status. Right panel: The Cepheid period–luminosity relation when various parts of a Cepheid are unscreened. Assuming unscreened Cepheids lie on the Newtonian relation underestimates their luminosity and hence their distance.</em></p>
<h1 id="unscreening-the-cosmic-distance-ladder">Unscreening the cosmic distance ladder</h1>
<p>We have quantified these effects to flesh out this potential resolution to the Hubble tension <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. We began by formulating a set of observational proxies for the screening behaviour of Cepheids, which encompasses both well-studied screening mechanisms such as chameleon, k-mouflage and Vainshtein, a newly-proposed mechanism based on interactions between baryons and dark matter <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>, and others described phenomenologically and not yet associated with an underlying theory. We then utilised the density field reconstruction of <a href="/method/borgpm.html">the BORG-PM model</a>; as encapsulated in the screening maps described in an earlier <a href="/method/observations/fifth_force.html">post</a> to evaluate these proxies over the Cepheids used in the distance ladder, and hence calculate the change in distance to the Cepheid hosts that the action of a screened fifth force would imply.</p>
<p>The magnitude of the difference depends on the strength of the fifth force. We determined maximum viable values of this in our models by means of consistency tests within the distance ladder data. The most constraining test compares the distances to galaxies measured by both the Cepheid period–luminosity relation and the tip of the red giant branch: these distances are pushed in different directions by a fifth force, so their consistency imposes a limit on the force’s strength. This is shown in Fig. 2, as a function of the fraction of galaxies that are unscreened and separately for the cases in which Cepheid cores (governing luminosity) are unscreened, or only Cepheid envelopes (governing period). This test is the strongest of its kind, and is completely agnostic as to the nature or origin of the modification to gravity.</p>
<p class="figure wide"><img src="/assets/posts/h0/Fig_2.png" alt="Figure 2" />
<em>Constraints on fifth-force strength (relative to gravity) from comparing Cepheid and tip-of-the-red-giant-branch distances, as a function of the fraction of galaxies that are unscreened. Dashed lines indicate typical unscreened fractions in our models.</em></p>
<h1 id="15sigma-consistency-of-local-and-cmb-h_0"><script type="math/tex">1.5\sigma</script> consistency of local and CMB <script type="math/tex">H_0</script></h1>
<p>Setting the screening threshold to ensure that the galaxies that calibrate the period–luminosity relation (N4258 and the MW) are screened, and imposing the bound on fifth-force strength from Fig. 2, we calculated the maximum reduction in the inferred <script type="math/tex">H_0</script> that each model could afford. Our results are shown in Fig. 3. While models that only unscreen Cepheid envelopes (right panel) can reduce the tension with Planck to <script type="math/tex">\gtrsim2\sigma</script>, those that unscreen cores (among them the baryon–dark matter interaction model, a dark energy model that is otherwise very little constrained) can achieve <script type="math/tex">1.5\sigma</script> consistency. These results reveal another possible advantage to cosmologies with fifth forces, as well as demonstrating more generally that novel local resolutions of the <script type="math/tex">H_0</script> problem are possible.</p>
<p class="figure wide"><img src="/assets/posts/h0/Fig_3.png" alt="Figure 3" />
<em>Constraints on local <script type="math/tex">H_0</script> for each of our screening models. The most successful models reach 1.5<script type="math/tex">\sigma</script> consistency with the Planck result, well below the level at which statistical fluctuations may account for the discrepancy.</em></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Harry Desmond, Bhuvnesh Jain, Jeremy Sakstein, 2019, <em>A local resolution of the Hubble tension: The impact of screened fifth forces on the cosmic distance ladder</em>, submitted to Phys. Rev. D., <a href="https://arxiv.org/pdf/1907.03778">arxiv 1907.03778</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>J. Sakstein, H. Desmond, B. Jain, 2019, <em>Screened Fifth Forces Mediated by Dark Matter–Baryon Interactions: Theory and Astrophysical Probes</em>, submitted to Phys. Rev. D., <a href="https://arxiv.org/pdf/1907.03775">arxiv 1907.03775</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 13 Jul 2019 00:00:00 +0200
https://www.aquila-consortium.org/method/observations/h0.html
https://www.aquila-consortium.org/method/observations/h0.htmlmethodobservationsAlgorithms for likelihood-free cosmological data analysis<h1 id="overview">Overview</h1>
<p>The extraction of physical information from wide and deep astronomical surveys relies on statistical techniques to compare models and observations. A common scenario in cosmology is when we can generate synthetic data through forward simulations, but cannot explicitly formulate the likelihood of the model. The generative process can be extremely general (a noisy non-linear dynamical system involving an unrestricted number of latent variables) and is often computationally expensive. Likelihood-free inference (LFI) provides a framework for performing Bayesian inference in this context, by replacing likelihood calculations with data model evaluations. In its simplest form, LFI takes the form of likelihood-free rejection sampling (LFRS), which tends to be (i) extremely expensive, since many simulated data sets get rejected, and (ii) very limited in the number of parameters that can be treated.</p>
<p>In two recent articles, we presented methodological advances, aiming at fitting cosmological data with “black-box” numerical models. Each of them addresses one of the shortcomings of LFRS. The first approach, BOLFI, is intended for specific cosmological models (with <script type="math/tex">n \lesssim 10</script> parameters) and a general exploration of parameter space. It combines Gaussian process regression of the distance between observed and simulated data with Bayesian optimization. As a result, the number of required simulations is reduced by several orders of magnitude with respect to LFRS. The second approach, SELFI, allows the inference of <script type="math/tex">n \gtrsim 100</script> parameters (as is necessary for a model-independent parametrization of theory) while assuming stronger prior constraints in parameter space. It relies on a Taylor expansion of the simulator to build an effective posterior distribution. The resulting algorithm allows LFI in much higher-dimensional settings than LFRS.</p>
<h1 id="likelihood-free-inference-of-black-box-data-models">Likelihood-free inference of black-box data models</h1>
<p>Simulator-based statistical models are usually given in terms of numerical “black-boxes”. They provide realistic predictions for artificial observations when provided with all necessary input parameters. These consist of target parameters as well as nuisance parameters such as initial phases, noise realization, sample variance, etc. This “latent space” can often be hundred-to-multi-million dimensional. Once all input parameters are fixed, the black-box typically consists of a simulation step and a data compression step. Black-box models can be written in a hierarchical form and conveniently represented graphically (figure 1).</p>
<p class="figure"><img src="/assets/posts/lfi/black-box_bhm.png" alt="Hierarchical representation of a black-box data model" />
<em>Hierarchical representation of a typical black-box data model. The rounded green boxes represent probability distributions and the purple square represent deterministic functions. For more details, see figure 1 in Leclercq et al. 2019.<sup id="fnref:3"><a href="#fn:3" class="footnote">1</a></sup></em></p>
<p>The goal of LFI is to find suitable approximations that allow an estimation of the probability distribution of target parameters conditional on observed data summaries, using only black-box evaluations.</p>
<h1 id="bolfi-bayesian-optimization-for-likelihood-free-inference">BOLFI: Bayesian Optimization for Likelihood-Free Inference</h1>
<p>BOLFI (Bayesian Optimization for Likelihood-Free Inference<sup id="fnref:1"><a href="#fn:1" class="footnote">2</a></sup><sup id="fnref:2"><a href="#fn:2" class="footnote">3</a></sup>) is a cutting-edge machine learning algorithm for LFI under the constraint of a very limited simulation budget (typically a few thousand), suitable when the problem has a sufficiently small number of target parameters (<script type="math/tex">n \lesssim 10</script>). Conventional approaches such as LFRS generally require too many simulations, due to their lack of knowledge about how the parameters affect the distance between observed and simulated data. As a response, BOLFI combines Gaussian process regression of this distance to build a surrogate surface with Bayesian Optimization to actively acquire training data (figure 2).</p>
<p class="figure wide"><img src="/assets/posts/lfi/bayesian_optimization.png" alt="Bayesian optimization" />
<em>Illustration of four consecutive steps of Bayesian optimization to learn a test function. For each step, the top panel shows the training data points (red dots) and the Gaussian process regression (blue line and shaded region). The bottom panel shows the acquisition function (solid green line). The next acquisition point, i.e. where to run a simulation to be added to the training set, is shown in orange. For more details, see figure 4 in Leclercq 2018.<sup id="fnref:2:1"><a href="#fn:2" class="footnote">3</a></sup></em></p>
<p>The target parameter space is explored efficiently and in all generality. We extended the method to use the optimal acquisition function for the purpose of minimizing the expected uncertainty in the approximate posterior density, in the parametric approach to likelihood approximation. As a result, the number of required simulations is typically reduced by two to three orders of magnitude, and the proposed acquisition function produces more accurate posterior approximations, as compared to LFRS.</p>
<h1 id="selfi-simulator-expansion-for-likelihood-free-inference">SELFI: Simulator Expansion for Likelihood-Free Inference</h1>
<p>Another limitation of conventional approaches to LFI is their inability to scale with the number of target parameters. In order to address problems of high-dimensional inference from black-box data models, we introduced SELFI (Simulator Expansion for Likelihood-Free Inference<sup id="fnref:3:1"><a href="#fn:3" class="footnote">1</a></sup>). Our approach builds upon a novel effective likelihood and upon the linearization of the simulator around an expansion point in parameter space. The workload with SELFI consists of evaluating the covariance matrix and the gradient of data summaries at the expansion point (figure 3). Contrary to likelihood-based Markov Chain Monte Carlo (MCMC) techniques and to BOLFI, it is fixed <em>a priori</em> and perfectly parallel.</p>
<p class="figure wide"><img src="/assets/posts/lfi/covariance_gradient.png" alt="Covariance and gradient of the black-box" />
<em>Covariance matrix (left) and gradient (right) of data summaries at the expansion point, evaluated through black-box realizations only. These are the only two ingredients necessary to apply SELFI. For more details, see figures 6 and 7 in Leclercq et al. 2019.<sup id="fnref:3:2"><a href="#fn:3" class="footnote">1</a></sup></em></p>
<p>The effective posterior of the target parameters is then obtained through simple “filter equations,” the form of which is analogous to a Wiener filter. SELFI allows the solution of inference tasks from black-box data models, in much higher dimension than conventional approaches to LFI.</p>
<h1 id="cosmological-applications-key-results">Cosmological applications: key results</h1>
<p>In respective papers, we presented the first applications of BOLFI and SELFI to cosmological data analysis.</p>
<h2 id="supernova-cosmology-with-bolfi">Supernova cosmology with BOLFI</h2>
<p>We applied BOLFI to the inference of cosmological parameters from the Joint Lightcurve Analysis (JLA) supernovae data. The model contains two cosmological parameters (the matter density of the Universe <script type="math/tex">\Omega_m</script> and the equation of state of dark energy <script type="math/tex">w</script>) and four nuisance parameters, which are marginalized over. The posterior contours obtained with MCMC, LFRS, and BOLFI are represented in figure 4.</p>
<p class="figure wide"><img src="/assets/posts/lfi/bolfi_jla.png" alt="Supernova cosmology with BOLFI" />
<em>Prior and posterior distributions for the joint inference of the matter density of the Universe, <script type="math/tex">\Omega_m</script>, and the dark energy equation of state, <script type="math/tex">w</script>, from the JLA supernovae data set. BOLFI (red posterior) reduces the number of necessary simulations by two orders of magnitude with respect to LFRS (green posterior) and three orders of magnitude with respect to MCMC (orange posterior). For more details, see figure 7 in Leclercq 2018.<sup id="fnref:2:2"><a href="#fn:2" class="footnote">3</a></sup></em></p>
<p>As can be observed, BOLFI is able to precisely recover the true posterior with as few as 6,000 simulations, which constitutes a reduction by two orders of magnitude with respect to LFRS and three orders of magnitude with respect to MCMC. This reduction in the number of required simulations accelerates the inference massively.</p>
<h2 id="primordial-power-spectrum-and-cosmological-parameters-inference-with-selfi">Primordial power spectrum and cosmological parameters inference with SELFI</h2>
<p>We applied SELFI to a realistic synthetic galaxy survey, with a data model accounting for physical structure formation and incomplete and noisy observations. This data model is provided by the publicly-available <strong>Simbelmynë</strong> code, a hierarchical probabilistic simulator of galaxy survey data.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> Through this application, we showed that the use of non-linear numerical models allows the galaxy power spectrum to be fitted up to at least <script type="math/tex">k_\mathrm{max} = 0.5~h/\mathrm{Mpc}</script>, which represents an increase by a factor of <script type="math/tex">\sim~5</script> in the number of modes used, with respect to traditional techniques. The result is an unbiased inference of the primordial power spectrum (living in <script type="math/tex">n =100</script> dimensions) across the entire range of scales considered, including a high-fidelity reconstruction of baryon acoustic oscillations (figure 5).</p>
<p class="figure wide"><img src="/assets/posts/lfi/selfi_power_spectrum.png" alt="Primordial power spectrum reconstruction with SELFI" />
<em>Primordial power spectrum inference with SELFI from a realistic synthetic galaxy survey. In spite of survey complications which limit the information captured, the inference is unbiased and the signature of baryon acoustic oscillations is well reconstructed up to <script type="math/tex">k \approx 0.3~h/\mathrm{Mpc}</script>, with 5 inferred acoustic peaks, result which could be improved using more volume (this analysis uses <script type="math/tex">(1~\mathrm{Gpc}/h)^3</script>). For more details, see figure 10 in Leclercq et al 2019.<sup id="fnref:3:3"><a href="#fn:3" class="footnote">1</a></sup></em></p>
<p>The primordial power spectrum can be seen as a largely agnostic and model-independent parametrization of theory, relying only on weak assumptions (isotropy and gaussianity). Using the linearized black-box, it can be easily translated <em>a posteriori</em> to constraints on specific cosmological models without (or with minimal) loss of information. For instance, constraints on the parameters of the standard cosmological model, for two different synthetic data realizations (with different input cosmologies, phase and noise realizations), are shown in figure 6.</p>
<p class="figure wide"><img src="/assets/posts/lfi/selfi_cosmology.png" alt="Inference of cosmological parameters with SELFI" />
<em>Cosmological parameter inference using a linearized black-box model of galaxy surveys. The prior is shown in blue, and the effective posteriors for two different data realizations are shown in red and purple.</em></p>
<p>We therefore obtain an unbiased and robust measurement of cosmological parameters.</p>
<div class="footnotes">
<ol>
<li id="fn:3">
<p>F. Leclercq, W. Enzi, J. Jasche & A. Heavens 2019, <em>Primordial power spectrum and cosmology from black-box galaxy surveys</em>, MNRAS <strong>490</strong>, 4237 (2019), <a href="https://arxiv.org/pdf/1902.10149">arxiv:1902.10149</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:3" class="reversefootnote">↩</a> <a href="#fnref:3:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:3:2" class="reversefootnote">↩<sup>3</sup></a> <a href="#fnref:3:3" class="reversefootnote">↩<sup>4</sup></a></p>
</li>
<li id="fn:1">
<p>M. U. Gutmann & J. Corander 2016, <em>Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models</em>, Journal of Machine Learning Research <strong>17</strong>, 1 (2016), <a href="https://arxiv.org/pdf/1501.03291">arxiv:1501.03291</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>F. Leclercq 2018, <em>Bayesian optimisation for likelihood-free cosmological inference</em>, Physical Review D <strong>98</strong>, 063511 (2018), <a href="https://arxiv.org/pdf/1805.07152">arxiv:1805.07152</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:2" class="reversefootnote">↩</a> <a href="#fnref:2:1" class="reversefootnote">↩<sup>2</sup></a> <a href="#fnref:2:2" class="reversefootnote">↩<sup>3</sup></a></p>
</li>
<li id="fn:4">
<p>The Simbelmynë code: <a href="http://simbelmyne.florent-leclercq.eu">homepage</a> <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 25 Apr 2019 00:00:00 +0200
https://www.aquila-consortium.org/method/lfi.html
https://www.aquila-consortium.org/method/lfi.htmlmethodPainting halos from 3D dark matter fields<h1 id="overview">Overview</h1>
<p>Investigating the formation and evolution of dark matter halos, as the key building blocks of cosmic
large-scale structure, is essential for constraining various cosmological models and further
understanding our Universe. The highly non-linear dynamics involved nevertheless renders this a
complex problem, with computationally costly simulations of gravitational structure formation
currently the only tool to compute the non-linear evolution from initial conditions, yielding mock
dark matter halo catalogues as the main output. However, running very large simulations of pure dark
matter to generate fake observations of the full Universe several times is not feasible, requiring a
large amount of memory and disk storage. A way to emulate such simulations, quickly and reliably,
would be of use to a wide community as a new method for data analysis and light cone production for
the next cosmological survey missions such as Euclid and Large Synoptic Survey Telescope. In this
context, we employ a deep learning approach to construct an emulator to learn the mapping from dark
matter density to halo fields.</p>
<h1 id="halo-painting-network">Halo painting network</h1>
<p>Our physical mapping network is inspired by a recently proposed variant of generative models, known
as generative adversarial networks (GANs). In particular, we will use the key ideas in training WGANs,
i.e. GANs optimized using the Wasserstein distance, to ensure that our network is able to paint halos
well. A schematic of this Wasserstein mapping framework is provided in Fig. 1. Our generator is the
halo painting network whose role is to learn the underlying non-linear relationship between the input
3D density field and the corresponding halo count distribution. Our critic provides as output the
approximately learned Wasserstein distance between the real and predicted halo distributions.
Intuitively, this Wasserstein distance can be interpreted as the amount of work required to transform
a given probability distribution into the desired target distribution. This distance therefore
corresponds the loss function that must be minimized to train the halo painting network.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/WGN_schematic.jpg" alt="Schematic representation of Wasserstein halo painting network" />
<em>Schematic representation of Wasserstein halo painting network implemented in this work.
The role of the generator is to learn the underlying non-linear relationship between the
input 3D density field and the corresponding halo count distribution. The difference
between the output of the critic for the real and predicted halo distributions is the
approximately learnt Wasserstein distance and is used as the loss function which must be
minimized to train the generator.</em></p>
<h1 id="remarkable-performance-of-halo-painting-emulator">Remarkable performance of halo painting emulator</h1>
<p>We showcased the performance our halo painting model using quantitative diagnostics. As a preliminary
qualitative assessment, we performed a visual comparison. Fig. 2 depicts the reference and predicted
halo distributions. Qualitative agreement is impressive, implying that the halo painting network is
capable of mapping the complex structures of the cosmic web, such as halos, filaments and voids, to
the corresponding distribution of halo counts.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/visual_comparison_N500.jpg" alt="Visual comparison" />
<em>Prediction of 3D halo field by our halo painting model for a slice of depth <script type="math/tex">\sim 100h^{-1}</script> Mpc
and side length of <script type="math/tex">\sim2000h^{-1}</script> Mpc. A blind validation dataset is shown in the top right
panel, with the predicted halo count depicted below it. The corresponding second order Lagrangian Perturbation Theory (2LPT) density field is
displayed in the top left panel, with the difference between the reference and predicted halo
distributions depicted in the lower left panel. A visual comparison of the reference and predicted
halo count distributions indicates qualitatively the efficacy of our halo painting network.</em></p>
<h2 id="power-spectrum">Power spectrum</h2>
<p>As quantitative assessment, the standard practice in cosmology is to use summary statistics.
These summary statistics provide a reliable metric to evaluate our halo painting network in
terms of their capacity to encode essential information. Assuming the cosmological density field
is approximately a Gaussian random field, as is the case on the large scales or at earlier times,
the power spectrum provides a sufficient description of the field. We therefore demonstrated
the capability of our network in reproducing the power spectrum of the reference halos. The left
panel of Fig. 3 illustrates the extremely close agreement of the 3D power spectra of the reference
and predicted halo fields.</p>
<p>We investigated the influence of the fiducial cosmology adopted for the simulations on the efficacy
of our halo mapping model. In the right panel of Fig. 3, we show the network predictions for two
cosmology variants in terms of their respective transfer functions, which is the ratio of the
square root of the ratio of the predicted to reference power spectra. The corresponding transfer
functions show a deviation of about <script type="math/tex">10\%</script> from the reference power spectra of their respective
real halo distributions on the smallest and largest scales. This shows that our halo painting model
is slightly sensitive to the underlying cosmology at the level of the power spectrum.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/Pk_cosmo_variation.jpg" alt="3D power spectra of reference and predicted halo fields" />
<em>Left panel: Summary statistics of the 3D power spectra of the reference and predicted halo fields
for one thousand randomly selected patches. The solid lines indicate their respective means, while
the shaded regions indicate their respective <script type="math/tex">1\sigma</script> confidence regions, i.e. 68\% probability
volume. The above diagnostics demonstrate the ability of our halo painting model to reproduce the
characteristic statistics of the reference halo fields and therefore provide substantial
quantitative evidence for the performance of our neural network in mapping 3D density fields to
their corresponding halo distributions. Right panel: The corresponding transfer functions highlight
the consistency between the power spectra reconstructed from the predicted and real halo fields for
the three cosmology variants, with the deviation from their respective reference spectra being below
<script type="math/tex">10\%</script>.</em></p>
<h2 id="bispectrum">Bispectrum</h2>
<p>The non-linear dynamics involved in gravitational evolution of cosmic structures contributes to a
certain degree of non-Gaussianity of the cosmic density field on the small scales. Higher-order
statistics are therefore required to characterize this non-Gaussian field. We used the bispectrum
to quantify the spatial distribution of the density and halo fields. The bispectra reconstructed
from the second order Lagrangian Perturbation Theory (2LPT), reference and predicted halo fields are displayed in Fig. 4. In particular, we show
the bispectra for a given small- and large-scale configurations. The 2LPT halo field corresponds
to a statistical description of the halo distribution, derived from the 2LPT density field, which
is valid, by construction, at the level of two-point statistics and on large scales. This allows
us to make a fair comparison between the clustering of the respective halo fields. The left panels
of Fig. 4 demonstrate that our halo painting network reproduces the non-linear halo field both on
the small and large scales, and is therefore capable of mapping the complex cosmic structures
apparent in the reference halo field. Our network predictions also show a significant improvement
over the corresponding 2LPT halo fields. In the right panels of Fig. 4, we find that there is a
more significant dependence of our network on the fiducial cosmology at higher order statistics.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/bispectrum_cosmo_variation.jpg" alt="3D bispectra of reference and predicted halo fields" />
<em>Left panels: Summary statistics of the 3D bispectra of the 2LPT, reference and predicted halo
fields for a given small- and large-scale configurations, as indicated by their respective titles.
In both cases, there is a close agreement between the bispectra from the reference and predicted
halo distributions. Our network predictions are a significant improvement over the corresponding
2LPT halo fields. Right panels: Deviation from the 3D bispectra of the reference halo distributions
of the corresponding predictions for the two cosmology variants. The above bispectrum diagnostics
show that our network is more sensitive to the fiducial cosmology than at the level of power spectrum.
The <script type="math/tex">1\sigma</script> confidence regions for five hundred randomly selected patches are depicted in each panel.</em></p>
<h1 id="key-advantages">Key advantages</h1>
<ul>
<li>Extremely efficient once trained. Our emulator is capable of rapidly predicting simulations of halo
distribution based on a computationally cheap cosmic density field. For instance, the network
prediction for a <script type="math/tex">256^3</script> simulation size requires roughly one second on the NVIDIA Quadro P6000.</li>
<li>Can predict the 3D halo distribution for any arbitrary simulation box size. A large simulation box,
therefore, does not require tiling of smaller sub-elements. More importantly, this implies that our
neural network can be trained on smaller simulations and subsequently used to predict large halo
distributions.</li>
<li>Encodes mass information of halos, such that our method can predict the mass distribution of halos.</li>
<li>Allows us to bypass ad hoc galaxy bias models and work in terms of better understood models.</li>
</ul>
<h1 id="potential-applications">Potential applications</h1>
<ul>
<li>Fast generation of mock halo catalogues and light cone production. This would be useful for the data
analysis of upcoming large galaxy surveys of unprecedented sizes.</li>
<li>To fill in small-scale structure at a high resolution from low resolution large-scale simulations.</li>
<li>As a component in Bayesian forward modelling techniques for large-scale structure inference (cf. BORG)
or cosmological parameter inference (cf. ALTAIR) to accelerate the scientific process, rendering
detailed and high-resolution analyses feasible. This would provide statistically interpretable results,
while maintaining the scientific rigour.</li>
</ul>
<h1 id="references">References</h1>
<ul>
<li>D. Kodi Ramanah, T. Charnock & G. Lavaux, 2019, submitted to PRD, <a href="https://arxiv.org/pdf/1903.10524">arxiv 1903.10524</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /></li>
<li>A notebook tutorial to paint the halos of the article: <a href="https://nbviewer.jupyter.org/github/doogesh/halo_painting/blob/master/wasserstein_halo_mapping_network.ipynb">notebook</a></li>
<li>Source code repository: <a href="https://github.com/doogesh/halo_painting">https://github.com/doogesh/halo_painting</a></li>
</ul>
Sun, 31 Mar 2019 00:00:00 +0100
https://www.aquila-consortium.org/method/halo-painting.html
https://www.aquila-consortium.org/method/halo-painting.htmlmethodBayesian treatment of unknown foregrounds<h1 id="overview">Overview</h1>
<p>To probe the Universe on the cosmological scales, we employ large galaxy redshift
catalogues which encode the spatial distribution of galaxies. However, these galaxy
surveys are contaminated by various effects, such as the contamination from dust,
stars and the atmosphere, commonly referred to as foregrounds. Conventional methods
for the treatment of such contaminations rely on a sufficiently precise estimate of
the map of expected foreground contaminants to account for them in the statistical
analysis. Such approaches exploit the fact that the sources and mechanisms involved
in the generation of these contaminants are well-known.</p>
<p>But how can we ensure robust cosmological inference from galaxy surveys if we are
facing as yet unknown foreground contaminations? In particular, the next-generation
of surveys (e.g. <a href="https://www.euclid-ec.org/">Euclid</a>, <a href="https://www.lsst.org/">LSST</a>)
will not be limited by noise but by such systematic effects. We propose a novel
likelihood<sup id="fnref:K"><a href="#fn:K" class="footnote">1</a></sup> which accurately accounts for and corrects effects of unknown foreground
contaminations. Robust likelihood approaches, as presented below, have a potentially
crucial role in optimizing the scientific returns of state-of-the-art surveys.</p>
<h1 id="robust-likelihood">Robust likelihood</h1>
<p>The underlying conceptual framework of our novel likelihood relies on the
marginalization of the unknown large-scale foreground contamination amplitudes. To
this end, we need to label voxels having the same foreground modulation and this is
encoded via a colour indexing scheme that groups the voxels into a collection of
angular patches. This requires the construction of a sky map which is divided into
regions of a given angular scale, with each region denoted by a specific colour, as
illustrated in Fig. 1 (a). The corresponding representation on a 3D grid results in
a 3D distribution of patches, with the a given slice of the coloured grid depicted
in Fig. 1 (b). The collection of voxels belonging to a particular patch is employed
in the computation of the robust likelihood.</p>
<p>Our proposed data model is conceptually straightforward and provides a maximally
ignorant approach to deal with unknown systematics, with the colouring scheme being
independent of any prior foreground information. As such, the numerical implementation
of our novel likelihood is generic and does not require any adjustments to the other
components in the forward modelling framework of BORG (Bayesian Origin Reconstruction
from Galaxies) for the inference of non-linear cosmic structures.</p>
<p class="figure wide"><img src="/assets/posts/robust/colours.jpg" alt="Colour indexing scheme on the sphere" />
<em>(a) Schematic to illustrate the colour indexing of the survey elements. Colours are
assigned to patches of a given angular scale. (b) Slice through the 3D coloured box
resulting from the extrusion of the colour indexing scheme on the left panel onto a
3D grid. This collection of coloured patches is subsequently employed in the
computation of the robust likelihood.</em></p>
<h1 id="comparison-with-a-standard-poissonian-likelihood-analysis">Comparison with a standard Poissonian likelihood analysis</h1>
<p>We showcase the application of our robust likelihood to a mock data set with
significant foreground contaminations and evaluated its performance via a comparison
with an analysis employing a standard Poissonian likelihood, as typically used in
modern large-scale structure analyses. The results illustrated below clearly
demonstrate the efficacy of our proposed likelihood in robustly dealing with unknown
foreground contaminations for the inference of non-linearly evolved dark matter
density fields and the underlying cosmological power spectra from deep galaxy
redshift surveys.</p>
<h2 id="inferred-dark-matter-density-fields">Inferred dark matter density fields</h2>
<p>We first study the impact of the large-scale contamination on the inferred non-linearly
evolved density field. We compare the ensemble mean density fields and
corresponding standard deviations for the two Markov chains obtained using BORG with
the Poissonian and novel likelihoods, respectively, illustrated in the top and bottom
panels of Fig. 2, for a particular slice of the 3D density field. As can be deduced from
the top left panel of Fig. 2, the standard Poissonian analysis results in spurious
effects in the density field, particularly close to the boundaries of the survey since
these are the regions that are the most affected by the dust contamination. In contrast,
our novel likelihood analysis yields a homogeneous density distribution through the
entire observed domain, with the filamentary nature of the present-day density field
clearly seen. From this visual comparison, it is evident that our novel likelihood is
more robust against unknown large-scale contaminations.</p>
<p class="figure wide"><img src="/assets/posts/robust/panels_density.png" alt="Inferred density fields" />
<em>Mean and estimated uncertainty of the non-linearly evolved density fields, computed
from the sampled realizations of the respective Markov chains obtained from both the
Poissonian (upper panels) and novel likelihood (lower panels) analyses, with the same
slice through the 3D fields being depicted. Unlike our robust data model, the standard
Poissonian analysis yields some artefacts in the reconstructed density field,
particularly near the survey boundary, where the foreground contamination is stronger.</em></p>
<h2 id="reconstructed-matter-power-spectra">Reconstructed matter power spectra</h2>
<p>From the realizations of our inferred 3D initial density field, we can reconstruct the
corresponding matter power spectra and compare them to the prior cosmological power
spectrum adopted for the mock generation. The top panels of Fig. 3 illustrates the
inferred power spectra for both likelihood analyses, with the bottom panels displaying
the ratio of the a posteriori power spectra to the prior power spectrum. While the
standard Poissonian analysis yields excessive power on the large scales due to the
artefacts in the inferred density field, the analysis with our novel likelihood allows
us to recover an unbiased power spectrum across the full range of Fourier modes.</p>
<p class="figure wide"><img src="/assets/posts/robust/Pk.jpg" alt="Reconstructed power spectra from likelihood analysis" />
<em>Reconstructed power spectra from the inferred initial conditions from the BORG analysis
for the robust likelihood (left panel) and the Poissonian likelihood (right panel).
The power spectra of the individual realizations, after the initial burn-in phase, from
the robust likelihood analysis possess the correct power across all scales considered,
demonstrating that the foregrounds have been properly accounted for. In contrast, the
standard Poissonian analysis exhibits spurious power artefacts due to the unknown
foreground contaminations, yielding excessive power on these scales.</em></p>
<div class="footnotes">
<ol>
<li id="fn:K">
<p>N. Porqueres, D. Kodi Ramanah, J. Jasche, G. Lavaux, 2018, submitted to A&A, <a href="https://arxiv.org/pdf/1812.05113">arxiv 1808.07496</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:K" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 14 Dec 2018 00:00:00 +0100
https://www.aquila-consortium.org/method/robust.html
https://www.aquila-consortium.org/method/robust.htmlmethodPrecision cosmology with expansion<h1 id="overview">Overview</h1>
<p>The exploration of the Universe at large relies mostly on the use of large
galaxy surveys, i.e. compilation of the position and optical properties of
galaxies in the sky. These surveys are either photometric, when only wide band
observations are available, or spectroscopic, for which the emission of each
galaxies have been finely described at different wavelength. From the luminous
properties we derive the ‘redshift’ of each galaxy, i.e. its total apparent
receding velocity.</p>
<p>Sophisticated and optimal data analysis techniques for cosmological inference
from galaxy redshift surveys are in increasing demand to cope with the present
and upcoming avalanches of cosmological data (e.g.
<a href="https://www.euclid-ec.org/">Euclid</a>, <a href="https://www.darkenergysurvey.org/">DES</a>,
<a href="https://www.desi.lbl.gov/">DESI</a>), and therefore optimize the scientific
returns of the missions. This is all the more critical that each survey brings
us closer to a full census of the galaxy distribution in our patch of Universe.
We are thus running out of exploitable information on our Universe. In our
latest article<sup id="fnref:K"><a href="#fn:K" class="footnote">1</a></sup> (also slides are availables<sup id="fnref:T"><a href="#fn:T" class="footnote">2</a></sup>), we present, for the first time, a non-linear Bayesian
inference framework to constrain cosmological parameters using a kind of
anisotropy visible in galaxy redshift surveys, via an application of the
Alcock-Paczyński (AP) test. This novel approach extracts several orders of
magnitude more information from the cosmological expansion compared to classical
approaches, to infer cosmological parameters and jointly reconstruct the
underlying 3D dark matter density field.</p>
<h1 id="alcock-paczyński-test">Alcock-Paczyński test</h1>
<div class="figure movie">
<div class="holder_video">
<video class="video-js" controls="" loop="" preload="metadata" data-setup="{"fluid":true}"><source src="/assets/posts/altair/cosmo_larger_ellipse_N256_small.mp4" type="video/mp4" /> <p class="vjs-no-js">To view this video please enable javascript, and consider upgrading to a web browser that <a href="https://videojs.com/html5-video-support/" target="_blank">supports HTML5 video</a>.</p> </video>
</div>
<em>A closed trajectory in the (<script type="math/tex">\Omega_{\mathrm{m}}</script>, <script type="math/tex">w_0</script>) plane, depicting the cosmological dependence of the cosmic expansion history for a fixed set of density initial conditions and powerspectrum.</em>
</div>
<p>The Alcock-Paczyński (AP) test is a cosmological test of the expansion of the
Universe and its geometry. The main advantage is that it is independent of the
evolution of galaxies but depends only on the geometry of the Universe. The
assumption of incorrect cosmological parameters in data analysis yields
distortions in the appearance of any spherical object or isotropic statistical
distribution. The AP test provides a pathway to exploit this resulting spurious
anisotropy to constrain the cosmological parameters. In this work, we invoke the
AP test to ensure that the underlying geometrical properties of isotropy of the
Universe are maintained. As such, the key underlying assumption relies purely on
the geometrical properties of the cosmological principle.</p>
<h1 id="inference-machinery">Inference machinery</h1>
<p>To encode the AP test, we developed an extension to the hierarchical Bayesian
inference machinery of BORG (Bayesian Origin Reconstruction from Galaxies),
originally developed for the non-linear reconstruction of large-scale
structures. Our physical model of the non-linearly evolved density field, as
probed by galaxy surveys, employs Lagrangian perturbation theory (LPT) to
connect Gaussian initial conditions to the final density field, followed by a
coordinate transformation to obtain the redshift space representation for
comparison with data. We implement a sophisticated Hamiltonian Monte Carlo
sampler to generate realizations of 3D primordial and present-day matter
fluctuations from a non-Gaussian LPT-Poissonian density posterior given a set of
observations. Our augmented framework with cosmological applications is
designated as ALTAIR (ALcock-Paczyński consTrAIned Reconstruction).</p>
<p>The essence of this AP test can be summarized as follows: The Bayesian inference
machinery explores the various cosmological expansion histories and selects the
cosmology-dependent evolution pathways which yield isotropic correlations of the
galaxy density field in comoving coordinates, thereby constraining cosmology. In particular, we sample
the present-day values of matter density and dark energy equation of parameters,
i.e. <script type="math/tex">\Omega_{\mathrm{m}}</script> and <script type="math/tex">w_0</script>, respectively. The reconstruction
scheme employed in ALTAIR is depicted in Figure 2.</p>
<p class="figure wide"><img src="/assets/posts/altair/reconstruction_schematic.jpg" alt="Schematic of the reconstruction pipeline" />
<em>This schematic illustrates the reconstruction pipeline of ALTAIR. The forward
model consists of a chain of various components for the non-linear evolution
from initial conditions and the subsequent transformation from comoving to
redshift space for the application of the AP test. This consequently transforms
the initial density field into a set of predicted observables, i.e. a galaxy
distribution in redshift space, for comparison with data via a likelihood or
posterior analysis.</em></p>
<h1 id="key-results">Key results</h1>
<p>We have showcased the performance of ALTAIR on a mock galaxy catalogue, that
emulates the features of the SDSS-III survey. The main aspects of our
investigation are summarized below.</p>
<h2 id="tight-cosmological-constraints">Tight cosmological constraints</h2>
<p>The marginal and joint posterior distributions for the cosmological parameters
are displayed in Figure 4, demonstrating the capability of ALTAIR to infer tight
constraints. Our AP test fully exploits the high information content from the
cosmic expansion as a result of probing a deep redshift range, where the
distortion is more pronounced.</p>
<p class="figure wide"><img src="/assets/posts/altair/seaborn_subplot_posteriors.jpg" alt="Cosmological constraints" />
<em>The marginal and joint posteriors for <script type="math/tex">\Omega_{\mathrm{m}}</script> and <script type="math/tex">w_0</script>
illustrate the potential of ALTAIR to yield tight cosmological constraints from
present and next-generation galaxy redshift surveys.</em></p>
<p>With baryon acoustic oscillations (BAOs) being a robust standard ruler, the AP
test has been utilized for the simultaneous measurement of the Hubble parameter
and angular diameter distance of distant galaxies. Therefore, as a comparison,
we depict the corresponding constraints obtained via BAO measurements from the
SDSS-III (Date Release 12) in Figure 4. These BAO constraints have not been
combined with Planck measurements, which would significantly tighten the
constraints. Nevertheless, this highlights the significant potential
constraining power of our AP test, compared to standard BAO analyses, while
being at least as robust.</p>
<p class="figure"><img src="/assets/posts/altair/error_ellipses_BAO_altair_inset.jpg" alt="Comparison of cosmological constraints from BAO measurements and our implementation of AP test" />
<em>Comparison of cosmological constraints from BAO measurements (SDSS-III, DR12)
and our implementation of AP test in ALTAIR. The ellipses denote their
respective 1-sigma confidence regions, centered on the fiducial cosmological
parameters. Note that the BAO constraints have not been combined with Planck
CMB measurements. This demonstrates the potential constraining power of our AP
test compared to standard BAO analyses, with the inset focusing on the ALTAIR
constraints where the fiducial cosmology is depicted in dashed lines.</em></p>
<h2 id="robustness-to-a-misspecified-model">Robustness to a misspecified model</h2>
<p>The main strength of our implementation of the AP test lies in its robustness to
a misspecified model and its inherent approximations, thereby near-optimally
exploiting the model predictions, without relying on its accuracy in modelling
the scale dependence of the correlations of the field.</p>
<p>We demonstrated this robustness of our AP test by employing a modified prior
power spectrum in the inference procedure. By adopting a different cosmology
(<script type="math/tex">\Omega_{\mathrm{m}} = 0.40</script> and <script type="math/tex">w_0 = -0.85</script>), we modify the shape of the
power spectrum, and subsequently apply ALTAIR on the same mock catalogue. As
shown in Figure 5, we recover the fiducial cosmological parameters employed in
the mock generation, although with slightly larger uncertainties than for the
original run by roughly 15%. This test case therefore explicitly highlights the
robustness of our implementation of the AP test to a misspecified model since it
does not optimize the information from the scale dependence of the correlations
of the density field, but rather from the isotropy of the field.</p>
<p class="figure wide"><img src="/assets/posts/altair/seaborn_subplot_posteriors_diff_Pk.jpg" alt="Cosmological constraints with modified prior" />
<em>Same as Figure 3, but employing a different prior power spectrum
(<script type="math/tex">\Omega_{\mathrm{m}} = 0.40</script> and <script type="math/tex">w_0 = -0.85</script>). By recovering the
fiducial cosmological parameters employed in the mock generation, this test
case explicitly highlights the robustness of our approach to the shape of the
prior power spectrum adopted. The corresponding uncertainties are slightly
larger than for the original run by around 15%.</em></p>
<h2 id="extremely-weak-dependence-on-galaxy-bias">Extremely weak dependence on galaxy bias</h2>
<p>The robustness of our method to model misspecification yields another key
aspect, which is that the cosmological constraints show extremely weak
dependence on the currently unresolved phenomenon of galaxy bias. This yields
two crucial advantages:</p>
<ul>
<li>
<p>This is especially interesting as the lack of a sufficient description of this
bias remains a potential limiting factor for standard approaches.</p>
</li>
<li>
<p>This also implies that our method does not depend on the absolute density
fluctuation amplitudes. This is therefore among the first methods to extract a
large amount of information from statistics other than that of direct density
contrast correlations, without relying on the power spectrum or bispectrum,
thereby providing complementary information to state-of-the-art techniques.</p>
</li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:K">
<p>D. Kodi Ramanah, G. Lavaux, J. Jasche & B. D. Wandelt, 2018, submitted to A&A, <a href="https://arxiv.org/pdf/1808.07496">arxiv 1808.07496</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:K" class="reversefootnote">↩</a></p>
</li>
<li id="fn:T">
<p>Talk by Doogesh Kodi Ramanah <a href="/assets/talks/DKR_Oxford_JC2018.pdf">(slides)</a> <a href="#fnref:T" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 27 Aug 2018 00:00:00 +0200
https://www.aquila-consortium.org/method/altair.html
https://www.aquila-consortium.org/method/altair.htmlmethodFifth force on galaxy cluster scale<h1 id="overview">Overview</h1>
<p>Although the current cosmological paradigm – <script type="math/tex">\Lambda</script>CDM – is remarkably successful at explaining a great range of observations, a number of puzzles suggest that it may need to be extended. Generic extensions introduce new fields alongside the metric tensor of General Relativity, which couple to matter and induce new interactions between objects. Called “fifth forces” because they supplement the four known fundamental forces of nature, these new interactions are the smoking guns of new physics.</p>
<p>Physicists have been searching for fifth forces in the Solar System and laboratory for several decades, placing ever tighter constraints on their strength and range. Recently however it’s become clear that many classes of extensions to <script type="math/tex">\Lambda</script>CDM would not be expected to produce observable deviations from General Relativity in these regimes. This is due to a property of the field equations known as <em>screening</em>, which implies that the fifth force effectively decouples from matter in high density regions such as the interior of the Milky Way. To probe the fifth forces of screened theories we need therefore tests beyond the Milky Way, in the low density environments of the Universe at large.</p>
<h1 id="cosmic-cartography">Cosmic Cartography</h1>
<p>A first step to testing screening is to identify which regions of the local Universe would be expected to be screened or unscreened in specific theories, based on the regions’ densities or gravitational field strengths. To do this, we combined the BORG-PM algorithm<sup id="fnref:BORGPM"><a href="#fn:BORGPM" class="footnote">1</a></sup> with a model of small-scale structure to reconstruct three measures of the gravitational field – Newtonian potential, acceleration and spacetime curvature – out to redshift <script type="math/tex">\sim0.05</script> <sup id="fnref:D18a"><a href="#fn:D18a" class="footnote">2</a></sup>. Figure 1 shows a slice through the Newtonian potential field: blue regions are those of weak gravitational field, which are most likely to harbour unscreened galaxies within which a fifth force is manifest. Newtonian potential is specifically relevant to the “chameleon” and “symmetron” screening mechanisms; acceleration and curvature govern the degree of screening under the “kinetic” and “Vainshtein” mechanisms respectively. Our maps – publicly available on Desmond’s <a href="https://www2.physics.ox.ac.uk/contacts/people/desmond">website</a> – provide each of these screening proxies at any point in space within <script type="math/tex">\sim 200 h^{-1}</script> Mpc.</p>
<p class="figure"><img src="/assets/posts/fifth_force/fig1.png" alt="Gravitational potential" />
<em>Contour plot of the gravitational potential across a 300 Mpc x 300 Mpc slice of the local universe (1 Mpc = 3.26 million light-years). The Milky Way is located at x=y=0. From Desmond et al 2018(a)<sup id="fnref:D18a:1"><a href="#fn:D18a" class="footnote">2</a></sup>.</em></p>
<h1 id="searching-for-new-forces">Searching for new forces</h1>
<p class="figure"><img src="/assets/posts/fifth_force/fig2a.png" alt="Conservative analysis" />
<em>A conservative analysis of the separation of stars and gas in galaxies in different gravitational environments produces precise constraints on the strength and range of a screened or unscreened fifth force. The region above the line is excluded. From Desmond et al 2018(b)<sup id="fnref:D18b"><a href="#fn:D18b" class="footnote">3</a></sup>.</em></p>
<p>Now knowing which galaxies ought to be screened and which not, we can search for observational differences between them. These differences arise because stars in otherwise unscreened galaxies are themselves dense, and therefore self-screen. Thus while gas and dark matter interact with surrounding mass via a fifth force, the stars do not, so that the various components of galaxies fall at different rates in an external field. In particular, the stellar disk lags behind the gas disk and dark matter halo in the direction of the exernal fifth force field. This has two observational consequences, which we have studied in detail:</p>
<ul>
<li>
<p>An offset between the centroids of optical (stellar) and HI (gas) emission<sup id="fnref:D18b:1"><a href="#fn:D18b" class="footnote">3</a></sup> <sup id="fnref:D18c"><a href="#fn:D18c" class="footnote">4</a></sup></p>
</li>
<li>
<p>A U-shaped warp in the stellar disk, bending away from the direction of the fifth force <sup id="fnref:D18d"><a href="#fn:D18d" class="footnote">5</a></sup></p>
</li>
</ul>
<p>In both cases we achieve sensitivity to fifth forces with strength ~1% that of gravity, for ranges <script type="math/tex">\sim0.5-50</script> Mpc. Assuming highly conservative observational uncertainties we place the strongest constraints to date on fifth-force properties at the scale of galaxies and their environments, as shown in Figure 2. Using a more realistic model for observational uncertainties, the analyses provide independent yet fully-compatible evidence for a screened fifth force of range <script type="math/tex">\lambda_C \simeq 2</script> Mpc and strength <script type="math/tex">\Delta G/G_N \simeq 0.02</script> (Figure 3). This is well below the detection threshold of any previous experiment. We caution however that baryonic physics may confound this inference; we will explore this in future work, alongside devising novel probes of other types of fundamental physics with our inference framework, such as dark matter self-interactions (Pardo et al 2018, in prep).</p>
<p class="figure"><img src="/assets/posts/fifth_force/fig3.png" alt="Less conservative analysis" />
<em>A less conservative analysis suggests the action of a screened fifth force operating on scales $\sim2$ Mpc, shown here from the study of galactic warps. The plot shows the increase in goodness-of-fit of the model over General Relativity as a function of fifth-force range. The dashed lines show the results of analysing mock data with a fifth-force signal injected by hand. From Desmond et al 2018(d)<sup id="fnref:D18d:1"><a href="#fn:D18d" class="footnote">5</a></sup>.</em></p>
<div class="footnotes">
<ol>
<li id="fn:BORGPM">
<p>See <a href="/method/borgpm.html">BORG-PM post</a> and Jasche & Lavaux, 2018, submitted to A&A, <a href="https://arxiv.org/pdf/1806.11117">1806.11117</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" />. <a href="#fnref:BORGPM" class="reversefootnote">↩</a></p>
</li>
<li id="fn:D18a">
<p><a href="http://dx.doi.org/10.1093/mnras/stx3062">MNRAS 474, 3152-3161</a> <img class="inline-logo svg" src="/assets/images/newspaper-solid.svg" alt="journal" />, <a href="https://arxiv.org/abs/1705.02420">arXiv:1705.02420</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" />. <a href="#fnref:D18a" class="reversefootnote">↩</a> <a href="#fnref:D18a:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:D18b">
<p>MNRAS Letters submitted, <a href="https://arxiv.org/abs/1802.07206">arXiv:1802.07206</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" />. <a href="#fnref:D18b" class="reversefootnote">↩</a> <a href="#fnref:D18b:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:D18c">
<p>PRD submitted, <a href="https://arxiv.org/abs/1807.01482">arXiv:1807.01482</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" />. <a href="#fnref:D18c" class="reversefootnote">↩</a></p>
</li>
<li id="fn:D18d">
<p>PRD submitted, <a href="https://arxiv.org/abs/1807.11742">arXiv:1807.11742</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" />. <a href="#fnref:D18d" class="reversefootnote">↩</a> <a href="#fnref:D18d:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
</ol>
</div>
Thu, 16 Aug 2018 00:00:00 +0200
https://www.aquila-consortium.org/method/observations/fifth_force.html
https://www.aquila-consortium.org/method/observations/fifth_force.htmlmethodobservationsThe BORG Particle-Mesh model<h1 id="overview-of-the-problem">Overview of the problem</h1>
<p>Accurate analyses of present and next-generation cosmological galaxy surveys
require new ways to handle effects of non-linear gravitational structure
formation processes in data. To address these needs we present an extension of
our previously developed algorithm for Bayesian Origin Reconstruction from
Galaxies to analyse matter clustering at non-linear scales in observations. This
is achieved by incorporating a numerical particle mesh model of gravitational
structure formation into our Bayesian inference framework.</p>
<h1 id="a-new-technology">A new technology</h1>
<p>The algorithm simultaneously infers the three-dimensional primordial matter
fluctuations from which present non-linear observations formed and provides
reconstructions of velocity fields and structure formation histories. The
physical forward modeling approach automatically accounts for the non-Gaussian
features in gravitationally evolved matter density fields and addresses the
redshift space distortion problem associated with peculiar motions of observed
galaxies. Our algorithm employs a hierarchical Bayes approach to jointly account
for various observational effects, such as unknown galaxy biases, selection
effects, and observational noise. Corresponding parameters of the data model are
marginalized out via a sophisticated Markov Chain Monte Carlo approach relying
on a combination of a multiple block sampling framework and an efficient
implementation of a Hamiltonian Monte Carlo sampler. We demonstrate the
performance of the method by applying it to the 2M++ galaxy compilation, tracing
the matter distribution of the Nearby Universe. We show accurate and detailed
inferences of the three-dimensional non-linear dark matter distribution of the
Nearby Universe. As exemplified in the case of the Coma cluster, our method
provides complementary mass estimates that are compatible with those obtained
from weak lensing and X-ray observations. For the first time, we also present a
reconstruction of the vorticity of the non-linear velocity field from
observations. In summary, our method provides plausible and very detailed
inferences of the dark matter and velocity fields of our cosmic neighbourhood.</p>
<p class="figure wide"><img src="/assets/posts/borgpm/chrono_sg.jpg" alt="Chronocosmography of the Nearby Universe" />
<em>This picture illustrates the capability to infer one plausible history of the
formation of Large scale structures. The history reads from left to right, top
to bottom. The ultimate snapshot shows the galaxies overlaying the inferred
density field.</em></p>
<h1 id="applications-in-cosmology">Applications in cosmology</h1>
<p>Our method has applications in all fields in cosmology, either for direct
measurements of underlying physical parameters or for comparing and correlating
with other observations of same part of the Universe. In that work, we have only
focused on three aspects: the measurement of masses of clusters and superclusters
of galaxies, the properties of the peculiar velocity field on large scales and
the study of claimed anomalies in the density fluctuations.</p>
<h2 id="cluster-mass-measurements">Cluster mass measurements</h2>
<p>The first direct application is the measurement of the mass of clusters of galaxies. We have defined this mass in the simplest possible fashion: the total mass enclosed within a radius $r$, or mathematically speaking:
<script type="math/tex">M(r) = \int_0^{R_\mathrm{max}} \rho(r)~\mathrm{d}r\,.</script>
We use for reference the mass enclosed if the the Universe content was strictly homogeneously distributed, or mathematically:
<script type="math/tex">M_\mathrm{mean}(r) = \frac{4\pi}{3} \rho_\mathrm{mean} r^3\,.</script></p>
<p>We showcase our estimator by focusing on one well studied object: the Coma cluster.
The performance of our estimator is given in the Figure below. We clearly observe
the compatibility of the measurement provided (solid lines and filled regions)
through BORG-PM inference with the other probes considered in that study.</p>
<p class="figure"><img src="/assets/posts/borgpm/coma_mass.jpg" alt="Coma mass profile" />
<em>The above pictures shows the mass profile, i.e. the mass enclosed within a given
distance of the object, derived through different methods and data of
the same cluster of galaxies: Coma. The BORG-PM method is given by the solid red
line (mean mass profile), and gray/dark gray filled regions for the 68% and 95%
limit. The other probes are given with their references and typical enclosed radius.</em></p>
<p>The advantage of our method is that this measurement can be freely reproduced for any structure within the observational boundaries. We have simply isolated a structure in the volume and asks about the mass.</p>
<h2 id="peculiar-velocity-field">Peculiar velocity field</h2>
<p>The second direct result of the analysis is the derivation of the peculiar velocity field
for the covered volume. Peculiar velocity field is notoriously complicated to get
right. Among the reasons, we find:</p>
<ul>
<li>large scale correlations leading to high sensitivity to boundary effects</li>
<li>requirement to have an unbiased total matter density field.</li>
<li>systematic effect arising from the use of redshifts to derive the tracer positions
and their contribution to the mass density (this is so-called Malmquist bias). The tracers
have also specific radial selection properties yielding more systematic effects.</li>
</ul>
<p>Classic methods have most relied on linear perturbation theory of density fluctuations to
derive estimators of these fields. The BORG-PM method allows a self-consistent derivation of
these fields including non-linearities. This allows for the first time to have a model of
completely non-linear fields like</p>
<p class="figure"><img src="/assets/posts/borgpm/pecvel.jpg" alt="Peculiar velocity field" />
<em>Peculiar velocity field picture</em></p>
<h1 id="the-future">The Future</h1>
<p>Some other applications are showcases in the paper (e.g. density anomalies, velocity field vorticity). We have only scratched the surface of the possibilities opened by this kind of inference. We invite the interested reader to have a closer look at the article and see recent related work, notably on the <a href="/method/observations/fifth_force.html">fifth work gravity</a> and <a href="/method/altair.html">Alcock Pasczyński</a> effects.</p>
<h1 id="references">References</h1>
<ul>
<li>J. Jasche & G. Lavaux, 2019, A&A, 625, A64, <a href="https://arxiv.org/pdf/1806.11117">arxiv 1806.11117</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /></li>
</ul>
Tue, 24 Jul 2018 00:00:00 +0200
https://www.aquila-consortium.org/method/borgpm.html
https://www.aquila-consortium.org/method/borgpm.htmlmethod