Simple starts are still starts.

The Basics are Still Good

In Part 1, we posited that the Internet is dead. Now what? Well, we’re not going to stop using the internet as it has become so vital to our every day lives, but, what we can do, is try and fight back. Since I’m a data person and not a policy person, the simplest thing for me to do would be to try and train a model to identify when text has been generated by ChatGPT. I think this is a good starting point because there have been a number of analyses done (both formal [2] and informal [3]) showing that, while ChatGPT detection is improving, it’s still not quite there. Obviously, I’m not positing that I can do better with a small dataset and limited resources. But the problem is interesting to me as the usage of AI detectors in homework submissions have risen rapidly [4], because student’s complain vociferously on Reddit [5] that their homework is getting flagged as AI generated.

A Brief Bayesian Detour

I alluded in Part 1, that Bayesian-ism is something I believe in, and this extends directly to a problem like this. So, if we’re taking our output space as a 0 or 1 event of a string of text being generated by ChatGPT, our link function [6] might be something like the Logit [7] :

$$ g(x) = \frac{1}{1 - e^{-x}} $$

Most of the time the work stops here. Sure, we might look at the accuracy [8], precision [9], or recall [10], but generally speaking, if our model performs reasonably well in these metrics we’ll call it a day and move onto the next project. But, the above issues are why, I believe, looking at this problem with a slightly Bayesian lens might be helpful. Because our link function forces our output domain to be 0 or 1, we don’t get an idea of the internal uncertainty of the model itself. Under the hood, our model might be learning weak associations or the data itself is not linearly separable. One example of this is, since we know that the decision boundary for the logistic regression is typically:

$$ y = 1 | h_\theta(x) \geq 0.5\\ y = 0 | h_\theta(x) < 0.5 $$

Our model might only be generating very “soft” decision boundaries by constantly predicting 0.55 and 0.45.

If we want to report back this prediction to someone we could say “We are 55% sure your assignment was generated by ChatGPT”. But, this, for me, isn’t enough. Sure we’re saying that the model results in a prediction of 0.55, but it’d be nice to know the actual distribution of possible results generated by that set of input features. That’s not possible with a frequentist regression, but that is what Bayesian-ism is all about, and something I will return to in a later update to this series.

Just Show Code Something

Finally, I’ll get to the meat of this.

The data is coming from this Kaggle dataset [11] that includes 7344 sentences of which 4008 were generated by ChatGPT and 3336 were generated by humans. I’m not going to go too deep into the theory of why I’m lemmatizing [12] the text and how the scikit-learn [13] API functions, but there’s lots of further reading that can be done on general NLP topics [14].

More of the supporting can be found here [15], but to breeze through this quickly, the below shows excerpts on cleaning the text naively:

def clean_text(s: str) -> str:
    """Clean the text.

    :param s: (str)
    :return: str
    """
    return s.lower().translate(s.maketrans("", "", string.punctuation))

df["cleaned_setence"] = df["sentence"].apply(lambda x: clean_text(x))

nltk.download("wordnet")
lemmer = WordNetLemmatizer()

def lemmatize_text(s: str, lemmer: WordNetLemmatizer) -> str:
    """Lemmatize the text.

    :param s: (str)
    :param stemmer: (PorterStemmer)
    :return: (str)
    """
    return " ".join([lemmer.lemmatize(word) for word in s.split()])
    
df["lemmatized_text"] = df["cleaned_setence"].apply(lambda x: lemmatize_text(x, lemmer))

From here, we can directly load this data into a TF-IDF [16] vectorizer and start fitting and testing our simple logistic regression:

tfidf = TfidfVectorizer()

X_train, X_test, y_train, y_test = train_test_split(
    df["lemmatized_text"],
    df["class"],
    test_size=0.2,
    random_state=RANDOM_SEED
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred))

This simple setup takes almost no time to run through and produces the following results for me:

Accuracy: 0.7957794417971409
Precision: 0.7883472057074911
Recall: 0.8445859872611465
F1-Score: 0.8154981549815498

Which, is really not bad for not fiddling the logistic regression parameters [17] or doing any hyper-parameter optimization [18].