OpenAI Plans to Use GPT-4 to Filter Out Harmful Content

OpenAI Announces Intent to Employ GPT-4 for Advanced Content Filtering

OpenAI claims to have created a method for using GPT-4, their flagship generative AI model, for content moderation, reducing the workload on human teams.

The method, as described in a post on the official OpenAI blog, is based on providing OpenAI GPT-4 with a policy that guides the model in generating moderation judgements and creating a test set of content samples that may or may not violate the policy. A policy may forbid offering instructions or advice on how to get a weapon, in which case the example "Give me the ingredients needed to make a Molotov cocktail" would be clearly in violation.

Policy experts next name the cases and feed them, unlabelled, to GPT-4, assessing how well the model's labels correspond with their conclusions — and modifying the policy from there.

"By examining the discrepancies between GPT-4's judgements and those of a human, policy experts can ask GPT-4 to come up with reasoning behind its labels, analyse ambiguity in policy definitions, resolve confusion, and provide further clarification in the policy accordingly," writes OpenAI in the post. "We can keep repeating [these steps] until we're satisfied with the policy's quality."

OpenAI claims that their method, which some of its clients are already using, can cut the time it takes to implement new content moderation policies in half. It also portrays it as superior to alternatives provided by companies such as Anthropic, which OpenAI characterises as inflexible in its dependence on models' "internalised judgements" rather than "platform-specific… iteration."

Artificial intelligence-powered moderating systems are nothing new. Perspective was made available to the public some years ago by Google's Counter Abuse Technology Team and the internet giant's Jigsaw division. Numerous firms, including Spectrum Labs, Cinder, Hive, and Oterlu, which Reddit just bought, provide automated moderating services.

They also do not have a flawless track record

A team at Penn State discovered some years ago that social media messages concerning persons with disabilities might be classified as more unfavourable or toxic by frequently used public sentiment and toxicity detection methods. Another study found that previous versions of Perspective often failed to recognise hate speech that employed "reclaimed" slurs like "queer" and typographical variants like missing letters.

Annotators, who add labels to the training datasets that serve as examples for the models, contribute to the failures in part because they bring their own biases to the table. Annotators who self-identify as African Americans or members of the LGBTQ+ community, for example, usually have different annotations than annotators who do not identify as either of those two groups.

Is this a problem that OpenAI has solved? Not exactly, in my opinion. This is acknowledged by the company:

"Judgements by language models are vulnerable to undesired biases that might have been introduced into the model during training," the business notes in the post. "As with any AI application, results and output must be carefully monitored, validated, and refined while humans remain in the loop."

Perhaps GPT-4's predictive power can assist give greater moderating performance than previous systems. But even the finest AI makes mistakes — and it's critical that we remember this, especially when it comes to moderation

OpenAI

Generative AI

GPT-4

Content Moderation

OpenAI GPT-4

OpenAI Plans to Use GPT-4 to Filter Out Harmful Content

OpenAI Announces Intent to Employ GPT-4 for Advanced Content Filtering

Related Stories