A Bayesian nightmare

Okay …

Remember, the whole goal of this theory is to try and allow for some sort of uncertainty in prediction to make its way into our model. This isn’t easy as a most machine learning models nowadays are inherently frequentist, and, furthermore, adding bayesian ideas to a neural network is quite a complex problem. Thankfully, there are libraries that are working in this space, one of which being pyro [2].

Priors and Others

Before we go into the model and the interpretation, let’s talk about why people have a problem with Bayes’ theorem and other computational issues. Just as a refresher, here’s Bayes’ theorem:

$$ p(\theta|X) = \frac{p(X|\theta)*p(\theta)}{p(X)} $$

Essentially:

$$ posterior = \frac{likelihood * prior}{marginal} $$

For our purposes:

One of the biggest issues with the theorem is the usage of a prior. The first time I started using Bayes’ theorem in earnest, the idea of using a prior belief to influence the model feels weird. It feels antithetical to “pure” data science, since you are biasing the results of the modeling process. And, many people share this opinion [3]. This also presents a practical issue. How do we choose a prior? The Bayesian backbone, Stan [4], has a whole article that specifies how to choose a prior [5]. I still don’t necessarily think there’s a one-size-fits-all way of defining a prior because it depends so much on your data, your beliefs, previous research, and seeing how your model behaves with posterior checks. (Or you can use empirical Bayes [6] and ignore setting of the priors entirely). But enough of this, far smarter people have written far more about this, let’s just code.

Party Time

All of the code can be found in my github, and I won’t show all of it because it is quite verbose. But, below are excerpts that are very important:

class DenseNetBayes(nn.Module):
    def __init__(self, input_shape):
        super().__init__()
        self.fc1 = nn.Linear(input_shape, 1024)
        self.fc2 = nn.Linear(1024, 256)
        self.fc3 = nn.Linear(256, 128)
        self.pred = nn.Linear(128, 2)

    def forward(self, x):
        x = F.relu(self.fc1(x.to(torch.float)))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.log_softmax(self.pred(x).squeeze())
        
        return x

And

def modeller(x, y):
    fc1w_prior = pyro.distributions.Normal(
        loc=torch.zeros_like(model.fc1.weight),
        scale=torch.ones_like(model.fc1.weight)
    )
    fc1b_prior = pyro.distributions.Normal(
        loc=torch.zeros_like(model.fc1.bias),
        scale=torch.ones_like(model.fc1.bias)
    )
		...

Pyro has much deeper documentation, but here we are using a similar dense neural network, but we put distributions over the parameters via the modeller function. This is important because when we sample and update the parameters, we want update our prior distributions on the parameters to better fit the data. Here, we run into another issue with Bayesian-ism. Sampling.