Trained on the dark web, Darkbert AI can combat cyber crimes

The AI intends to assist cybersecurity experts in gathering cyber threat intelligence

In an unprecedented step, a group of South Korean academics created DarkBERT, an LLM trained only on dark web information. They aimed to develop an artificial intelligence tool that outperforms existing language models and aids threat researchers, law enforcement, and cybersecurity experts in combating cyber threats.

What is DarKBERT?

DarkBERT is a RoBERTa architecture-based transformer-based encoder model. The LLM was trained on millions of dark web pages, including data from hacker forums, scamming websites, and other criminal internet sources. The word dark web refers to an unreachable concealed area of the internet using standard web browsers. The sector is well-known for its anonymous websites and markets, which are notorious for criminal activities such as the trafficking of stolen data, narcotics, and firearms.

The researchers used the Tor network to obtain access to the dark web and collect raw data to train DarkBERT. They meticulously sifted this data using techniques such as deduplication, category balancing, and pre-processing to produce a refined dark web database. It was then fed to Roberta over around 15 days to produce DarkBERT.

DarkBERT's Potential Use in Cybersecurity: DarkBERT has an exceptional comprehension of cybercriminals' lingua franca and excels in identifying particular possible dangers. It can conduct dark web research and successfully discover and highlight cybersecurity dangers such as data breaches and ransomware, making it a potentially valuable weapon in the battle against cyber threats.

Researchers compared DarkBERT to two well-known NLP models, BERT and RoBERTa, analyzing their performance across three critical cybersecurity-related use cases, according to the research published on arxiv.org.

Check Dark Web Forums for Potentially Hazardous Topics: Monitoring dark web forums, which are widely used to exchange unlawful information, is critical to discover potentially harmful posts. But, manually examining them may be time-consuming, so that security specialists will benefit from the automation of the process.
Locate Websites That Store Sensitive Information: Hackers and ransomware groups use the dark web to set up leak sites to reveal secret information stolen from firms refusing to pay ransom demands. Some fraudsters just post leaked sensitive material to the dark web, such as passwords and bank information, intending to sell it.
Detect Threat-Related Keywords on the Dark Web: DarkBERT uses the fill-mask function, a BERT-family language model feature, to reliably detect phrases linked with criminal activities, such as drug transactions on the dark web. DarkBERT created drug-related words when "MDMA" was hidden on a drug sales website, but other models suggested generic words and keywords unrelated to drugs, such as numerous professions. The capacity of DarkBERT to discover phrases associated with illegal actions might help identify and resolve new cyber risks.

Use of AI for Threat Detection and Prevention: DarkBERT was pre-trained on dark web data and outperformed existing language models across many cybersecurity use cases, establishing itself as a critical tool for furthering dark web research. The dark web-trained AI might be used for various cybersecurity activities, such as identifying websites selling leaked personal data, monitoring dark web forums for illicit information exchange, and finding keywords relevant to cyber dangers. However, remember that DarkBERT, like other LLMs, is a work in progress, and its performance may be increased with continual training and fine-tuning.

Artificial Intelligence