
Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. Moreover, outcomes are expected to be significant in both practical and theoretical aspects, and they may pave the way for further research. The good performance of the linear SVC might support potential tools to help the authorities in the detection of these activities. The results of the experiments conducted in this study show that using the Term Frequency-Inverse Document Frequency (TF-IDF) word representation with a linear support vector classifier achieves 91% of 5 folds cross-validation accuracy when classifying a subset of illegal activities from crawler-DB, while the accuracy of Naïve Bayes was 80.6%. A popular text representation method was used with the proposed crawler-DB crossed by two different supervised classifiers to facilitate the categorization of the Tor concealed services. The algorithm built in this study demonstrated good performance as it achieved an accuracy of 85%.

The link addresses that are gathered by the crawler are then classified automatically into five classes. To build the crawler-DB, the Onion Routing Network (Tor) was sampled, and then a web crawler capable of crawling into links was built. In this work, a novel dataset for Dark Web active domains known as crawler-DB is presented. With the freedom offered by the Deep Web, people have the opportunity to express themselves freely and discretely, and sadly, this is one of the reasons why people carry out illicit activities there.
