I probably didn’t need to do this.

A Diversion

I feel like I wouldn’t be doing my due diligence if I didn’t at least pause and try and fit a deep neural network [2] to this hypothetical problem. Again, I don’t need to do this, but I can imagine that there would be calls to look at this problem strictly from a deep learning sense since ChatGPT itself (and generative AI) is a type of deep learning [3]. I am by no means an expert on deep learning and NLP but there are so many tutorials online [4] [5] [6], that I’m going to run through a quick and dirty example of how to train a neural net and make predictions. All associated code can be found in my Github [7].

Show Me The Code

The text cleaning is the same as in Part 2 to keep things consistent, but this time, instead of fitting a simple Logistic Regression, we’re going to create a relatively shallow neural network on our data and see what kind of accuracy we can achieve.

class DenseNetwork(nn.Module):
    
    def __init__(self):
        super(DenseNetwork,self).__init__()
        self.fc1 = nn.Linear(chatgpt_dataset.x_tfidf.shape[1], 1024)
        self.drop1 = nn.Dropout(0.8)
        self.fc2 = nn.Linear(1024, 256)
        self.drop2 = nn.Dropout(0.6)
        self.fc3 = nn.Linear(256, 128)
        self.drop3 = nn.Dropout(0.4)
        self.prediction = nn.Linear(128, 2)
        
    def forward(self, x):
        
        x = F.relu(self.fc1(x.to(torch.float)))
        x = self.drop1(x)
        x = F.relu(self.fc2(x))
        x = self.drop2(x)
        x = F.relu(self.fc3(x))
        x = self.drop3(x)
        x = F.log_softmax(self.prediction(x).squeeze())
        
        return x

The above is a really pruned down example of a DenseNet [8], where all the layers are connected and dropout is inserted as a regularization method to prevent from overfitting [9]. An important point to note is that the first nn.Linear dimension is the long shape of the TF-IDF data that we generated (the number of “columns”). We instantiate this model and add a loss function [10] and slap the Adam [11] optimizer function on it.

model = DenseNetwork()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

From here, we can easily train the model:

epochs = 7
losses = []
accuracies = []

for epoch in range(1, epochs + 1):
    epoch_loss = 0.0
    epoch_true = 0
    epoch_total = 0
    for data, target in train_loader:
        optimizer.zero_grad()
        outputs = model(data)

        loss = criterion(outputs, target)
        loss.backward()
    
        optimizer.step()
        
        epoch_loss += loss.item()
        
        _, pred = torch.max(outputs,dim=1)
        epoch_true = epoch_true + torch.sum(pred == target).item()
        
        epoch_total += target.size(0)
        
    losses.append(epoch_loss)
    accuracies.append(100 * (epoch_true / epoch_total))
    
    print(f"Epoch {epoch}/{epochs} finished: train_loss = {epoch_loss}, train_accuracy = {accuracies[epoch - 1]}")

And then test the model:

test_true = 0
test_total = len(test_sampler)
test_loss = 0.0
with torch.no_grad():
    for data, target in validation_loader:
        outputs = model(data)
        
        loss = criterion(outputs, target).item()
        
        _, pred = torch.max(outputs, dim=1)
        
        test_true += torch.sum(pred == target).item()
        test_loss += loss
        

print(f"Validation finished: Accuracy = {round(100 * (test_true / test_total), 2)}%, Loss = {test_loss}")

Comparison is the Thief of Joy

With the very simple logistic regression, we reported an accuracy of approximately 79.6%, which is really not bad given how coarse we’ve been with data pre-processing and training. The DenseNet implementation yields and accuracy of about 80.4%. So, our DenseNet is 0.8% better than our logistic regression (meaning if we were to feed 1000 ChatGPT prompts to our prediction models we would expect the DenseNet to accurately identify 16 more sentences than our logistic regression). Is that better? Quantitatively sure, qualitatively who knows. Obviously our dataset is not very large and maybe we’d see better performance from our DenseNet if we gather more data, but neural nets in general have a number of problems [12]. They are also not the easiest to deploy into production [13]. Given the flexibility of the logistic regression and its relatively speed, I’d probably prefer to deploy that into production rather than the DenseNet.

One More Thing

A final thought before launching into the full Bayesian experience. I alluded to this in Part 2, but ideally we would like to see not a number prediction but a probability prediction for our classifier. Instead of saying that our regression or DenseNet predicts a value of 0.7 (i.e. class of 1 when rounded), we would like to say that our model, given this set of parameters, produces a probability density that has a mass centered around 0.7 with 95% credible intervals stretching from 0.5 to 0.9. I think this is a much better way to translate the inherent stochasticity of the real world into a model. This requires more thought on the modeler’s part as we now have to figure out how to report back a prediction (do we say any probability density with a mean of > 0.5 should be rounded to class 1, do we only consider values where the 95% credible intervals don’t cross 0.5, etc.), but it means that we can report back a prediction with measures of uncertainty. Which, going back to the original point of this series of posts, is the ultimate goal. Can we produce a model that not only accurately predicts if a text is generated by ChatGPT and can this model report back the uncertainty of this prediction.

Further Reading