January 2007

(authored by Peter F. Patel-Schneider, Program Co-Chair)

Now that the dust has mostly settled and my nerves have calmed down, I decided that I would write a short note about the WWW2007 paper submission process.

When running any conference there is always the worry that the number of papers will not be what was expected.  There is the possibility of a disaster – perhaps the research area is imploding, perhaps conference publicity went astray, perhaps something will go wrong with the submission process, etc., etc. – and too few papers are submitted.  (With all the problems with spam email a worry is that conference announcements will be caught by over-zealous spam filters.)
There is also the possibility of a success-disaster with so many papers submitted that the reviewing machinery and program committee is overloaded. 

To add to these general worries, WWW has a history of problems.  Last year the building housing the computers for the conference web site experienced a major fire just before submissions were due.  In previous years, the submission site had serious capacity problems.

For WWW2007 there was also a new submission process – the EasyChair system.  As well, the reviewing process for WWW2007 has essentially no slack in it so slipping the submission deadline, as has become quite common, was not an option.
With all these issues, I was rather nervous about the number of submissions for WWW2007.  To try to calm my nerves I planned on counting the number of submissions at various points.  In my previous experience with running conferences (admittedly a long time ago) the rule of thumb was that 1/3 of the submissions arrived the last day, 1/3 the day before, and 1/3 before that, so I had some expectations on how the numbers would go.

Unfortunately, the early “returns” were very low.  One week before the deadline there were only 55 submissions.  Two days before the deadline there were only 132 submissions.  By my rule of thumb this would mean about 400 total submissions – a rather large drop from the 716 submissions in 2006.  One day before the deadline there were only 252
submissions, indicating only about 380 total submissions.  I was now definitely beginning to worry.  Although the pace of submissions picked up during the last day, by the time I went to sleep about six hours before the deadline, there were only 522 submissions, and I was still quite nervous.

Of course, all my worries turned out to be unfounded.  A very late surge of submissions (253 submissions in the last six hours) resulted in 775 submissions to WWW2007, more than in any previous year, but not more than had been allowed for. 

In retrospect, I should have expected this late surge, as electronic submission allows for last-minute behaviour and researchers are notorious for not being early.  However, I instead expected that the history of problems with WWW would have made more authors more
conservative.  There were a couple of tracks that had to add a few extra PC members, but surprisingly little had to be done to react to the submissions.

Now if only the reviewing process works as well….

(invitation to the AIRWeb’2007 workshop, authored by ChaTo (Carlos Alberto-Alejandro CASTILLO-Ocaranza))

The AIRWeb workshop is now in its third edition. This workshop includes several topics related to Adversarial Information Retrieval on the Web, that is, how to search, rank, or classify documents if a fraction of the documents has been manipulated with a malicious intent. This includes search engine spam as well as comment spam, splogs, click fraud, and several other themes.

The dominant topic in past years has been search engine spam, an obnoxious problem that affects all major search engines, either by tricking them into showing irrelevant results for some queries, or simply by wasting a part of their network and storage resources. This year, the AIRWeb workshop will include a novel element: a reference collection of Web pages, in which over 3,000 hosts have been labeled by a team of volunteers as spam or non-spam.

The following is a partial view of the corpus (black nodes are spam, white nodes are non-spam):


The organizers of the Web Spam Challenge provide the graph, training labels, the contents for the pages, and a set of pre-computed feature vectors. The goal is to predict the label (non-spam or spam) for a test set of hosts for which labels are not given. For more information, check the challenge web site.

See you in Banff!

(invitation to the Query Log Analysis workshop, authored by Einat Amitay, IBM Research, Haifa, Israel)

The dilemma of whether to use or not to use the AOL query log data for research is described in detail in a NYTimes article: “Researchers Yearn to Use AOL Logs, but They Hesitate“. Search Engine companies no longer support independent academic research and have stopped sharing their data with graduate students and university professors. The hesitation and the data embargo are stopping research from being conducted, which in turn increases the gap between what is known to the public via published research and what is hidden behind corporate legalese.

We initiated this workshop thinking that WWW 2007 is the right place to open this issue and find a solution that will allow researchers to use query log data without the fear of being accused of a crime.

There are many ways in which we can help amend the situation. We can establish a research collection of query logs donated by consenting individual users. We can create a standard for accepting or rejecting log recording similar to the robots.txt solution. We can promote research for anonymization of logs. And we can help persuade the public that our intentions are good and that search engines live and die by their data.

We hope to have all sides represented in our workshop. Please come and join us!