(invitation to the AIRWeb’2007 workshop, authored by ChaTo (Carlos Alberto-Alejandro CASTILLO-Ocaranza))

The AIRWeb workshop is now in its third edition. This workshop includes several topics related to Adversarial Information Retrieval on the Web, that is, how to search, rank, or classify documents if a fraction of the documents has been manipulated with a malicious intent. This includes search engine spam as well as comment spam, splogs, click fraud, and several other themes.

The dominant topic in past years has been search engine spam, an obnoxious problem that affects all major search engines, either by tricking them into showing irrelevant results for some queries, or simply by wasting a part of their network and storage resources. This year, the AIRWeb workshop will include a novel element: a reference collection of Web pages, in which over 3,000 hosts have been labeled by a team of volunteers as spam or non-spam.

The following is a partial view of the corpus (black nodes are spam, white nodes are non-spam):


The organizers of the Web Spam Challenge provide the graph, training labels, the contents for the pages, and a set of pre-computed feature vectors. The goal is to predict the label (non-spam or spam) for a test set of hosts for which labels are not given. For more information, check the challenge web site.

See you in Banff!

(invitation to the Query Log Analysis workshop, authored by Einat Amitay, IBM Research, Haifa, Israel)

The dilemma of whether to use or not to use the AOL query log data for research is described in detail in a NYTimes article: “Researchers Yearn to Use AOL Logs, but They Hesitate“. Search Engine companies no longer support independent academic research and have stopped sharing their data with graduate students and university professors. The hesitation and the data embargo are stopping research from being conducted, which in turn increases the gap between what is known to the public via published research and what is hidden behind corporate legalese.

We initiated this workshop thinking that WWW 2007 is the right place to open this issue and find a solution that will allow researchers to use query log data without the fear of being accused of a crime.

There are many ways in which we can help amend the situation. We can establish a research collection of query logs donated by consenting individual users. We can create a standard for accepting or rejecting log recording similar to the robots.txt solution. We can promote research for anonymization of logs. And we can help persuade the public that our intentions are good and that search engines live and die by their data.

We hope to have all sides represented in our workshop. Please come and join us!