Wed 10 Jan 2007
The AIRWeb workshop is now in its third edition. This workshop includes several topics related to Adversarial Information Retrieval on the Web, that is, how to search, rank, or classify documents if a fraction of the documents has been manipulated with a malicious intent. This includes search engine spam as well as comment spam, splogs, click fraud, and several other themes.
The dominant topic in past years has been search engine spam, an obnoxious problem that affects all major search engines, either by tricking them into showing irrelevant results for some queries, or simply by wasting a part of their network and storage resources. This year, the AIRWeb workshop will include a novel element: a reference collection of Web pages, in which over 3,000 hosts have been labeled by a team of volunteers as spam or non-spam.
The following is a partial view of the corpus (black nodes are spam, white nodes are non-spam):
The organizers of the Web Spam Challenge provide the graph, training labels, the contents for the pages, and a set of pre-computed feature vectors. The goal is to predict the label (non-spam or spam) for a test set of hosts for which labels are not given. For more information, check the challenge web site.
See you in Banff!