Obtain a sample from (or the mode statistic of) the true posterior $$ p(y, z \mid x) \propto p(y|x, z) p(z) $$, We define some joint model \(p(y, \theta | x) = p(y | x, \theta) p(\theta) \), We obtain observations \( \mathcal{D} = \{ (x_1, y_1), ..., (x_N, y_N) \} \), We would like to infer possible values of \(\theta\) given  observed data \(\mathcal{D}\) $$ p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} | \theta) p(\theta)}{\int p(\mathcal{D}|\theta) p(\theta) d\theta} $$, We will be approximating true posterior distribution with an approximate one, Need a distance between distributions to measure how good the approximation is $$ \text{KL}(q(x) || p(x)) = \mathbb{E}_{q(x)} \log \frac{q(x)}{p(x)} \quad\quad \textbf{Kullback-Leibler divergence}$$, Not an actual distance, but \( \text{KL}(q(x) || p(x)) = 0 \) iff \(q(x) = p(x)\) for all \(x\) and is strictly positive otherwise, Will be minimizing \( \text{KL}(q(\theta) || p(\theta | \mathcal{D})) \) over \(q\), We'll take \(q(\theta)\) from some tractable parametric family, for example Gaussian $$ q(\theta | \Lambda) = \mathcal{N}(\theta \mid \mu(\Lambda), \Sigma(\Lambda)) $$, Then we reformulate the objective s.t. In this study, we used two deep-learning algorithms based … Predicting survival after hepatocellular carcinoma resection using deep-learning on histological slides Hepatology. @article{zhang2019pathologist, title={Pathologist-level interpretable whole-slide cancer diagnosis with deep learning}, author={Zhang, Zizhao and Chen, Pingjun and McGough, Mason and Xing, Fuyong and Wang, Chunbao and Bui, Marilyn and Xie, Yuanpu and Sapkota, Manish and Cui, Lei and Dhillon, Jasreman and others}, journal={Nature Machine Intelligence}, volume={1}, number={5}, … We will help you become good at Deep Learning. deep learning is driving significant advancements across industries, enterprises, and our everyday lives. to parameters \(\theta\) of the generator also! Nature 2015 We will be giving a two day short course on Designing Efficient Deep Learning Systems at MIT in Cambridge, MA on July 20-21, 2020. Machine Learning: An Overview: The slides presentintroduction to machine learningalong with some of the following: 1. To get around the costly computations associated with large models and data, the … Yoshua Bengio gave a recent presentation on “Deep Learning of Representation” and Generative Stochastic Networks (GSNs) at MSR and AAAI 2013. Different types of learning (supervised, unsupervised, reinforcement) 2. Often we can't, we use approximate posteriors, Probability Theory is a great tool to reason about uncertainty, Bayesians quantify subjective uncertainty, Frequentists quantify inherent randomness in the long run, People seem to interpret probability as beliefs and hence are Bayesians, We formulate our prior beliefs about how the \( x \) might be generated, We collect some data of already generated \( x \): $$ \mathcal{D}_\text{train} = (x_1, ..., x_N) $$, We update our beliefs regarding what kind of data exist by incorporating collected data, We now can make predictions about unseen data, And collect some more data to improve our beliefs, We'll assume random variables have and are described by their, \(p(X=x)\) (\(p(x)\) for short) – its probability density function, \(\text{Pr}[X \in A] = \int_{A} p(X=x) dx\) – distribution function, In general several random variables \(X_1, ..., X_N\) have, It describes joint probability $$\text{Pr}(X_1 \in A_1, ..., X_N \in A_N) = \int_{A_1} ... \int_{A_N} p(x_1, ..., x_N) dx_N ... dx_1 $$, If (and only if) random variables are independent, the joint density is just a product of individual densities, Vector random variables are just a bunch of scalar random variables, For 2 and more random variables you should be considering their joint distribution, \(\mathbb{E}_{p(x)} X = \int x p(x) dx\) –, \( \mathbb{E} [\alpha X + \beta Y] = \alpha \mathbb{E} X + \beta \mathbb{E} Y \), \( \mathbb{V} X = \mathbb{E} [X^2] - (\mathbb{E} X)^2 = \mathbb{E}(X - \mathbb{E} X)^2 \), \(X\) is said to be Uniformly distributed over \((a, b)\) (denoted \(X \sim U(a, b)\) if its probability density function is $$ p(x) = \begin{cases} \tfrac{1}{b-a}, & a < x < b \\ 0, &\text{otherwise} \end{cases} \quad\quad \mathbb{E} U = \frac{a+b}{2} \quad\quad \mathbb{V} U = \frac{(b-a)^2}{12} $$, \(X\) is called a Multivariate Gaussian (Normal) random vector with mean \(\mu \in \mathbb{R}^n\) and positive-definite covariance matrix \(\Sigma \in \mathbb{R}^{n \times n}\) (denoted \(x \sim \mathcal{N}(\mu, \Sigma)\)) if its joint probability density function is, \(X\) is said to be Categorically distributed with probabilities, \(X\) is called a Bernoulli random variable with probability (of success) \(p \in [0, 1]\) (denoted \(X \sim \text{Bern}(\pi)\)) if its probability mass function is $$ p(X = 1) = \pi \Leftrightarrow p(x) = \pi^{x} (1-\pi)^{1-x} $$ (yes, this is a special case of the categorical distribution), Joint density on \(x\) and \(y\) defines the, Knowing value of \(y\) can reduce uncertainty about \(x\), expressed via the, Thus $$ p(x, y) = p(y|x) p(x) = p(x|y) p(y) $$, Suppose we're having two jointly Gaussian random variables \(X\) and \(Y\): $$(X, Y) \sim \mathcal{N}\left(\left[\begin{array}{c}\mu_x \\ \mu_y \end{array} \right], \left[\begin{array}{cc}\sigma^2_x & \rho_{xy} \\ \rho_{xy} & \sigma^2_y\end{array}\right]\right)$$, Then one can show that marginal and conditionals are also Gaussian $$ p(x) = \mathcal{N}(x \mid \mu_x, \sigma^2_x) $$  $$ p(y) = \mathcal{N}(y \mid \mu_y, \sigma^2_y) $$  $$p(x|y) = \mathcal{N}\left(x \mid \mu_x + \tfrac{\rho}{\sigma_x^2} (y - \mu_y), \sigma^2_x - \tfrac{\rho_{xy}^2}{\sigma_y^2}\right)$$, If we're interested in \(y\), then these distributions are called, We assume some data-generating model $$p(y, \theta \mid x) = p(y \mid x, \theta) p(\theta) $$, We obtain some observations \( \mathcal{D} = \{(x_n, y_n)\}_{n=1}^N \), We seek to make make predictions regarding \(y\) for previously unseen \(x\) having observed the training set \(\mathcal{D}\). "Backpropagation applied to handwritten zip code recognition." How to decide upon number of layers at the test time? To find out more, please visit MIT Professional Education. with some fixed probability \(p\) it's 0 and with probability \(1-p\) it's some learnable value \(\Lambda_i\), Then for some prior \(p(\theta)\) our optimization objective is $$ \mathbb{E}_{q(\theta|\Lambda)} \sum_{n=1}^N \log p(y_n | x_n, \theta) \to \max_{\Lambda} $$ where the KL term is missing due to the model choice, No need to take special care about differentiating through samples, Turns out, these are bayesian approximate inference procedures. Deep learning models work in layers and a typical model atleast have three layers. The Jupyter notebooks for the labs can be found in the labs folder of Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. The course is Berkeley’s current offering of deep learning. We want to make predictions about some \( x \), $$ p(X = k) = \pi_k \Leftrightarrow p(x) = \prod_{k=1}^K \pi_k^{[x = k]} $$, Variational Dropout Sparsifies Deep Neural Networks, D. Molchanov, A. Ashukha, D. Vetrov, ICML 2017. The 12 video lectures cover topics from neural network foundations and optimisation through to generative adversarial networks and responsible innovation. Free + Easy to edit + Professional + Lots backgrounds. This course concerns the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, convolutional and recurrent nets, with applications to computer vision, natural language understanding, and speech recognition. I sometimes blog about different cutting-edge-like topics: Importance Weighted Hierarchical Variational Inference Teaser, Importance Weighted Hierarchical Variational Inference -- Extended, Importance Weighted Hierarchical Variational Inference, Law of the Unconscious Statistician $$ \mathbb{E} f(x) = \int f(x) p(x) dx $$, ​If \(X\) and \(Y\) are independent, \( \mathbb{V}[\alpha X + \beta Y] = \alpha^2 \mathbb{V} X + \beta^2 \mathbb{V} Y \), \( \text{Cov}(X, Y) = \mathbb{E} [X Y] - \mathbb{E} X \mathbb{E} Y \) –, \( \mathbb{V} [\alpha X + \beta Y] = \alpha^2 \mathbb{V}[X] + \beta^2 \mathbb{V}[Y] + 2 \alpha \beta \text{Cov}(X, Y) \), $$ \quad\quad\quad\quad\quad\quad\quad\quad\quad \quad \quad p(x) = \frac{1}{\sqrt{\text{det}(2 \pi \Sigma)}} \exp \left( -\tfrac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right) $$, $$ p(x_N = x \mid x_{