Last updated: 27 April 2018 - M.T. Carrasco Benitez

18th International World Wide Web Conference, 20-24 April 2009, Madrid


On the 20th Anniversary of the Web, Welcome to the 18th International World Wide Web Conference.

The International World Wide Web Conferences Steering Committee (IW3C2), Universidad Politécnica de Madrid (UPM) and Madrid municipality cordially invite you to participate in the 18th International World Wide Web Conference (WWW2009) to be held in Madrid, the charming and cosmopolitan Spain's capital.

The World Wide Web Conference is the global event that brings together key researchers, innovators, decision-makers, technologists, businesses, and standards bodies working to shape the Web.

Organized by IW3C2 since 1994, the WWW conference is the annual opportunity for the International community to discuss and debate the evolution of the Web. The conference will feature a range of presentations on world-class research, as well as stimulating talks, workshops, tutorials, panels, and late-breaking posters.

WWW2009 seeks original papers describing research in all areas of the Web covering the implications of ubiquitous access to the Web through the "three screens" - computer, phone, and TV - and how such Web access will change the way we live, work, and interact in the future. Accepted refereed papers will appear in the conference proceedings published by the Association for Computing Machinery (ACM).

Topics of discussion will include Browsers and User Interfaces, Data Mining, Industrial Practice and Experience, Internet Monetization, Mobility, Performance and Scalability, Rich Media, Search, Security and Privacy, Semantic / Data Web, Social Networks and Web 2.0, Web Engineering, XML and Web Data. The conference will also feature plenary speeches by renowned speakers, and tracks devoted to developers and to recent W3C activities that are of interest to the community.

Program, register and full information


The photo tag for the Conference is WWW2009Madrid. Each event also has a photo tag, indicated in the appropriate section. For example, the photo tag for the Semantic Search Workshop is SemSearch09, which is automatically combined with the Conference photo tag. To request a new photo tag contact the editor.

Special Events

Welcome Reception hosted by the Fundación ONCE


By invitation only.


DinnerTalk Professor Dame Wendy Hall

Abstract. Every conference has its own culture, its own character and its own community. As someone who came to the Web via the hypertext community, I couldn't understand why people who attended hypertext conferences didn't want to come to Web conferences, and I couldn't understand why people who attended Web conferences weren't interested in hypertext. I had the same feelings of frustration when the Semantic Web community decided to launch its own conference series. To me this was the future of the Web and key developments needed to be discussed at the Web conference. But we now have a very healthy Semantic Web conference community, who meet on an annual basis to discuss concepts that may be of vital importance to the future of the Web but many of them will never have been to a Web conference.

In this talk I will explore the history of the Web conference from the perspective of colliding cultures. What dynamics created the conference, how has it evolved along the way and where might it go next? I'm one of a group of people who have just successfully organised the first Web Science conference in Athens. Why did we do this? Why didn't we organise it as part of the Web conference? Any of you who have organised a major international conference know what hard work it is. Why do we make life so difficult for ourselves by constantly spinning off new conferences? Why indeed in this digital age do we still feel the need to physically attend conferences?

Closing Ceremony


Press cutting


Reflecting on the last 20 years and looking forward to the next 20

Keynote-1 Sir Tim Berners-Lee, Director of the World Wide Web Consortium


The Continuing Metamorphosis of the Web

Keynote-2 Dr. Alfred Z. Spector, VP Research and Special Initiatives, Google, Inc.

Abstract. The invention of HTML and HTTP catalyzed a path of enormous innovation that was hard to foresee in the early 1990's. The Web's continuing metamorphosis has led to fantastically increased capabilities and economic value. It has catalyzed the creation of distributed systems orders of magnitude larger than any previously built, new programming and distribution models for computer applications, great advances in the fields of information retrieval, entirely new domains for theoretical computer science, and more. This greatly enhanced web is changing the entire environment and enabling some early research promises to become a reality for most Internet users. In this presentation, I will discuss such examples, and in particular, what happens when speech, image processing, human language translation, and mobility are woven into all we do. I will also extrapolate from some current research and advanced web technologies to paint a picture of the web five-to-ten years out. This should have implications for the computer science community, as well as the vast community that is leveraging the web for ever greater goals.

Web infrastructure for the 21st Century

Keynote-3 Dr. Pablo Rodriguez, Internet Scientific Director, Telefonica R&D

Abstract. The Web success in leading the information technology revolution has relied on an enormous computing infrastructure. In recent years, both cloud computing as well as social networks are putting even more burden in such infrastructure. The cloud computing paradigm is creating a massive shift in computing Ð from PC- based applications to cloud-based applications. Cloud computing frees users from having to remember where the data resides, gives users access to information anywhere, and provides fast services through essentially infinite online computing. Social networks, on the other hand, emerge the social aspects of the Web where the social interactions put demands on Web applications, and in turn further demands in the Web's infrastructure. The Internet, which was mostly developed for interactive applications between humans and computers, has struggled to handle the necessities of a Web designed around content. For instance, as Web content moves from one place to another, Web pointers are broken and so do search ranking algorithms. Similarly, content is often not where it should be when you need it and routers waste capacity copying the same content millions of times. As a result, we have seen the emergence of Internet systems that were not planned for from the beginning, e.g. Web Caching, Content Distribution Networks, or P2P networks. In this talk I will discuss the challenges that the Web is posing in today's Internet infrastructure, and argue about various solutions to cope with them. In particular, I will argue how to re-think the Internet to do networking at the content/information layer, and the underlying architectural system design principles needed to guide the efficient engineering of new Web infrastructure services.

Mining the Web 2.0 for Better Search

Keynote-4 Dr. Ricardo Baeza-Yates, Yahoo! Research, Barcelona, Spain

Abstract. There are several semantic sources that can be found in the Web that are either explicit, e.g. Wikipedia, or implicit, e.g. derived from Web usage data. Most of them are related to user generated content (UGC) or what is called today the Web 2.0. In this talk we show several applications of mining the wisdom of crowds behind UGC to improve search. We will show live demos to find relations in the Wikipedia or to improve image search as well as our current research in the topic. Our final goal is to produce a virtuous data feedback circuit to leverage the Web itself.


Web 20th Anniversary panel

Chair: Wendy Hall

Abstract. As WWW2009 coincides with the 20th Anniversary of the Web conference organisers would like to include this panel to commemorate it. The initial idea is to provide a rich overview of the past 20 years and as well as forecast into the future.

Living Web: Making its Diversity a true asset

Chair: Claudia Niederee, L3S Research Center, Hannover

Abstract. The large diversity (with respect to cultural background, school of thought, intention, opinion etc. ) underlying the content creation process of the Web contributes a lot to the wealth and value of the Web. On the other hand, the wide variety of opinions and biased content found on the Web makes it difficult to get a balanced view on a given subject, given that many people only consider the ten search results provided as the first page of results by a typical search engine. This panel will controversially discuss the risk of biasing through the Web as well as the potentials that are inherent in diversity and that today are only marginally exploited.

The role of the Telco in the new Web: is possible to create a win-win relationship?

Chair: Luis Angel Galindo

Abstract. Main issues to cover are (but not only):

Can operators offer services more than the connectivity? Net neutrality and the separation of network and services New paradigms for competition: iPhone, flat rate in mobiles, open source, open assets to third parties

Is possible a new ecosystem (as I pursuit in the WIMS 2.0 initiative) more fair for all the actors? User 2.0: user demands customized services. Are big players ready to provide this services?

Is increasing the number of technololgy disable people?

Are there ways in which operators need to behave more like wireline ISPs, and ways in which wireline needs to concentrate on the role of clearinghouse rather than aggregator?

The Emergence of Web Science

Chair: Nigel Shadbolt

Abstract. Since the term was coined in 2005, Web Science has provided a rallying call for researchers who are interested in the social and organisational behaviour engendered by the Web as about the underpinning technology. Web Science is inherently inter-disciplinary. Web Science research aims to provide the means to better model the Web’s structure, describe the principles that have fuelled its phenomenal growth, and discover how online human interactions are driven by and can change social conventions. Research is needed to reveal the principles that will ensure that the network continues to grow productively. Research is required to implement principles that can settle complex issues such as privacy protection and intellectual property rights. Of course, we cannot predict what this nascent discipline might reveal. But Web science has already generated powerful insights, how the Web is structured, how resilient it is, how ideas travel through the tens of millions of blogs, how we might include information in Web content so that its accuracy and origin is more transparent. At the micro scale, the Web is an infrastructure of artificial languages and protocols; it is a piece of engineering. However, it is fundamentally about the interaction of human beings creating, linking and consuming information. It is this interaction that we also need to research and understand. It is this interaction that generates the Web's behavior as emergent properties at the macro scale. These macro properties are often surprising and require analytic methods to understand them. The Web’s use is part of a wider system of human interaction – the Web has had profound effects on society, with each emerging wave creating both new challenges and new opportunities available to wider sectors of the population than ever before.

The Web Science Research Initiative (WSRI) was launched in November 2006 by Tim Berners-Lee, Wendy Hall, Nigel Shadbolt and Daniel Weitzner. At WWW2007 in Banff, WSRI sponsored a reception to present the ideas behind Web Science to the WWW community. At WWW2008, WSRI sponsored a workshop entitled “Understanding Web Evolution: A Prerequisite for Web Science” chaired by Dave De Roure. It attracted a lot of excellent papers – see - and was one of the largest workshops at the conference. In March 2009, WSRI is running its first Web Science conference in Athens, WebSci’09 – see - with the aim of bringing computer scientists and social scientists together to discuss this important topic.

The aim of this panel is to bring this debate to the heart of the WWW community at WWW2009 in Madrid.

Web Search APIs: The Next Generation

Chair: Soumen Chakrabarti

Abstract. Web search APIs are provided by a handful of large Web search companies and used by a large number of developers. Yet, downstream applications today have to depend not only on an adversarial corpus with uncertain boundaries, but also on a best-effort, black-box search system without any formal guarantees. The provocative proposition here is that unless search APIs evolve to provide more expressive semistructured query languages involving entity and relation types, and more precise control over matching, ranking, sampling and aggregation, search consumers will remain restricted to relatively simplistic downstream processing, such as mashups that join search results with other services like maps and yellow pages. In this panel we will urge the two sides to the search contract to discuss what features they want in the next generation of search APIs, whether the technology exists to provide these features, and whether the providers should move to provide these features.

Cancelled: Web-based Multimedia Search and Retrieval

Chair: Pere Obrador and Xavier Anguera

Abstract. Multimedia search and retrieval research has gone a long way from its beginnings. The research community has proposed several very good algorithms for search and retrieval of both images and video, which take advantage of content analysis, text associated with that content (i.e., user generated tags, web page text, etc.) and associated camera metadata. Also, the last few years have seen the appearance of web sites that allow for image search (e.g., video search (e.g. and audio search (e.g., and are being used by millions of users. In most cases, these commercial search algorithms use the text and tags associated with the media, but not the content analysis generated features, due to its disconnection from the end users' language terminology. Also the user feedback is rarely incorporated in the search process, reducing the personalization of the search results. Future commercial multimedia search and retrieval systems will have to bridge the semantic gap between low level features and the user' language so that these features can be incorporated in the search process, which should be customizable. In order to accomplish this, a standardized way of measuring user satisfaction will have to be implemented in the research community.

Invited talks

The Web as an Ideal Interface to Ensure Accessibility in Different ICT Environments

Invited-1 Jose Angel Martinez-Usero, Technosite, Fundación ONCE

Abstract. Nowadays, digital services tend to allow users to access information in a remote way and from different devices. Fields such as domotic or telecare applications are evolving to this model, for example, it is already possible to take control of our home from a computer in our office or from a mobile phone. These technological developments benefit many users, (i.e. senior citizen and disabled people), so they are able to overcome the barriers caused by their functional diversity. Moreover, there are guidelines that establish the basis for designing different technological applications. In the Internet field, W3C has developed the Web Content Accessibility Guidelines (WCAG 2.0) that are the basic reference in the production of national and international standards and laws. The combination of accessible and usable user interfaces with communication interoperability between the user personal device and the target device solves many of the barriers people find in their daily life.

Patents for Software ? Is your Invention Patentable ?

Invited-2 Wolfgang König, European Patent Office

Abstract. The core activity of the European Patent Office (EPO) is the examination of patent applications with a view to granting a European patent. Like all other inventions, computer-implemented inventions are patentable only if they have technical character (i.e. solve a technical problem), are new and involve an inventive technical contribution to the prior art. The EPO does not grant patents for computer programs ("software patents") or computer-implemented business methods that make no such technical contribution. The EPO is bound by European patent law. Our expert will guide you through the technicalities of patenting computer-implemented inventions.

Innovation in the Wild, Wild Web

Invited-3 Dale Dougherty, O'Reilly

Abstract. Behind many innovations in technology are the efforts of wild-eyed enthusiasts who set out to explore its possible uses. From the development of the first personal computers to the development of the Web, and its re-emergence as Web 2.0, small groups of enthusiasts have led the way. More often than not, they've been way ahead of commercial and academic organizations in discovering the potential for a new technology. So, when enthusiasts begin to explore and play with a new technology, we've learned at O'Reilly to take that as an early warning sign that the technology deserves serious consideration. O'Reilly has built a publishing company around following the work of such enthusiasts, whom we also call alpha geeks, and helping to spread their knowledge more widely. The emergence of the Maker movement is an example of innovation in the wild, just like the early Web, where a group of enthusiasts are tinkering with physical things and creating new, smart objects. They are using sensors, data loggers, and open source hardware to develop new ways of interacting with the world. If you want to know what technologies will help shape our future, find the enthusiasts who are already working with these technologies today.

The Emerging Mobile Widgets Ecosystem

Invited-4 Daniel Appelquist, Vodafone



2nd International Workshop on Linked Data on the Web (LDOW 2009)

Organising Committee:

Objectives. The Web is increasingly understood as a global information space consisting not just of linked documents, but also of linked data. More than just a vision, the Web of Data has been brought into being by the maturing of the Semantic Web technology stack, and by the publication of an increasing number of datasets according to the principles of Linked Data. Today, this emerging Web of Data includes data sets as extensive and diverse as DBpedia, Geonames, US Census, EuroStat, MusicBrainz, BBC Programmes, Flickr, DBLP, PubMed, UniProt, FOAF, SIOC, OpenCyc, UMBEL and Yago. The availability of these and many other data sets has paved the way for an increasing number of applications that build on Linked Data, support services designed to reduce the complexity of integrating heterogeneous data from distributed sources, as well as new business opportunities for start-up companies in this space.

Building on the success of last year's LDOW workshop at WWW2008 in Beijing, the LDOW2009 workshop aims to provide a forum for presenting the latest research on Linked Data and drive forward the research agenda in this area. While last year's workshop focused on the publication of Linked Data, this year's workshop will focus on Linked Data application architectures, linking algorithms and Web data fusion.

Workshop on Web Search Result Summarization and Presentation (WSSP 2009)

Workshop Chairs

Objectives. Providing a satisfying web search experience can be a challenging task for a search engine. Numerous disciplines -- search, summarization, user interface design, usability, metrics, machine learning and modeling -- all have to come together in order to deliver the final experience. Effective summarization is part of the challenge. In this workshop we will focus on various aspects of web summarization, presentation, and user satisfaction metrics and models. The kinds of questions and issues we would like to address are:

Topics of Interest. Main topics of interest in the areas of Web Results Summarization and Presentation include but are not limited to:

3rd International Workshop on Information Credibility on the Web (WICOW2009)

Workshop chairs:

Description. As computers and computer networks become more common, a huge amount of information, such as that found in Web documents, has been accumulated and circulated. Such information gives many people a framework for organizing their private and professional lives. However, in general, the quality control of Web content is insufficient due to low publishing barriers. In result there is a lot of mistaken or unreliable information on the Web that can have detrimental effects on users. This situation calls for technology that would facilitate judging the trustworthiness of content and the quality and accuracy of the information that users encounter on the Web. Such technology should be able to handle a wide range of tasks: extracting credible information related to a given topic, organizing this information, detecting its provenance, clarifying background, facts, and other related opinions and the distribution of them, and so on. The problem of Web information reliability and Web data quality has become also apparent in the view of the recent emergence of many popular Web 2.0 applications, the growth of the so-called Deep Web and the ubiquity of Internet advertising.

2nd Workshop on Mashups, Enterprise Mashups and Lightweight Composition on the Web (MEM 2009)


Description. Quick adoption of mashup and lightweight composition technologies is one of landmarks of contemporary World Wide Web evolution. The MEM 2009 workshop will adopt a multi-aspect perspective on these technologies to enable a complex understanding of their adoption, research and business challenges and prospects. Firstly, it will cover basic infrastructures and standards enabling scalable, secure and flexible deployment and enactment of mashup and lightweight composition technologies. Secondly, it will focus on user interfaces and interaction paradigms that make different approaches to user-driven content and functionalities composition available to non-expert users and collaborating crowds. Thirdly, it will take a look into remarkable cases of mashups that combine existing services and information sources with intelligent information processing algorithms and user collaboration to provide value-added services. Finally, the workshop will cover topics related to business aspects of mashup and lightweight composition technologies.

The major part of innovation related to mashups and lightweight composition is now happening in on line Web sites intended for wide audience from all over the world. However, these technologies have prospects of becoming major evolution factor for contemporary enterprises by bridging the gap between business and IT. Thus, the workshop will focus both on usage of mashup solutions in open Internet environments underlining the requirement for flexibility, performance and privacy preservation, and in business intranet environments stressing the need for security and control measures to be incorporated in mashup and composition solutions.

The goal of this workshop is to provide a platform for discussing research topics underlying the concepts of lightweight composition, mashups, and enterprise mashups. By bringing together representatives of academia and industry, the workshop is also an important venue for identifying new research problems and disseminating results of the research. By affiliating with a renowned international conference, the workshop provides a possibility to interact with researchers from other areas of the domain of Information Systems.

1st International Workshop on Motivation and Incentives on the Web (Webcentives'09)


Content The Web 2.0 movement has brought a new generation of usability and socio-technical change to the Web. At the same time, several so-called Web 2.0 applications had enormous success: Wikipedia,, Flickr, YouTube, Facebook, Twitter, Geni - to name just a few. Having differing objectives, they all have something in common: huge amounts of enthusiastic users contributing and creating a plethora of content. The high acceptance of these applications with Web users from all over the world prove that they are usable and - more importantly - provide some kind of benefit. Each of the applications has incentive structure well in place, triggering user interest and involvement.
The aim of the workshop is to address the following questions around incentives and motivation of Web applications: what is the motivation for a user to (install and) use a tool? Which incentive structures can be applied to the Web, which cannot? Moreover, incentives are a crucial topic for future Web generations: Web paradigms, like the Semantic Web or the 3D Web, that are novel and unfamiliar to end users, aim to involve wide user bases. WEBCENTIVES will attract contributions analyzing, applying, and designing incentive structures for Web applications. We want to emphasize that the workshop also aims at failures, i.e. cases where incentives failed, in order to understand why they failed and to disseminate the lessons learned.

Fifth International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2009)


Welcome. AIRWeb is a series of international workshops focusing on Adversarial Information Retrieval on the Web that brings together both researchers and industry practitioners, to present and discuss advances in the state of the art.

Semantic Search Workshop (SemSearch09)


Objectives. In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. However, the representation of user queries and resource content in existing search appliances is still almost exclusively achieved by simple syntax-based descriptions of the resource content and the information need such as in the predominant keyword-centric paradigm (i.e. keyword queries matched against bag-of-words document representation).

On the other hand, recent advances in the field of semantic technologies have resulted in tools and standards that allow for the articulation of domain knowledge in a formal manner at a high level of expressivity. At the same time, semantic repositories and reasoning engines have only now advanced to a state where querying and processing of this knowledge can scale to realistic IR scenarios.

In parallel to these developments, in the past years we have also seen the emergence of important results in adapting ideas from IR to the problem of search in RDF/OWL data, folksonomies, microformat collections or semantically tagged natural text. Common to these scenarios is that the search is focused not on a document collection, but on metadata (which may be possibly linked to or embedded in textual information). Search and ranking in metadata stores is another key topic addressed by the workshop.

As such, semantic technologies are now in a state to provide significant contributions to IR problems. In this context, several challenges arise for Semantic Search systems. These include, among others:

2nd Web People Search Evaluation Workshop (WEPS 2009)


Objectives. Finding information about people in the World Wide Web is one of the most common activities of Internet users. Person names, however, are highly ambiguous. In most cases, the results for a person name search are a mix of pages about different people sharing the same name. The user is then forced either to add terms to the query (probably losing recall and focusing on one single aspect of the person), or to browse every document in order to filter the information about the person he/she is actually looking for. In an ideal system the user would simply type a person name, and receive search results clustered according to the different people sharing that name.

The Web People Search (WePS) Evaluation is a shared evaluation campaign focused on this problem. In this second evaluation 19 research teams from around the world have participated in two tasks: (i) clustering web pages to solve the ambiguity of search results, and (ii) extracting 18 kinds of "attribute values" for target individuals whose names appear on a set of web pages. Participants where provided with task guidelines and development data, and later evaluated their systems on a new testbed.

The WePS-2 Workshop will feature an overview of the lessons learned in this evaluation, talks from the participant teams, a poster session and invited talks from academic and industry speakers.

Content Analysis in the WEB 2.0 (CAW2.0)


Objectives. Web mining deals with understanding, and discovering information in, the World Wide Web. Web mining focuses on analyzing three different sources of information: web structure, user activity and the contents. When referring to the Web 2.0, web structure and user activity related data can be dealt with in a very similar way that in the case of the traditional Web, however, in the case of contents, conventional analysis and mining procedures are not suitable anymore. This is mainly because, in the Web 2.0, contents are generated by users, who make a very free use of language and are constantly incorporating new communication elements which are generally context dependent. This kind of language can also be found on chats, SMS, e-mails and other channels of informal textual communication.

International Cross-Disciplinary Conference on Web Accessibility


Objectives. Population demographics indicate that our populations are ageing across the board. As the population ages the financial requirement to work longer is increased, but the ability to work longer is reduced because disability becomes a bar to employment. With the growth of the knowledge economy, and a move from manual work to more thought-based and communication-based activities, there is the very real possibility of older Web users being able to finding productive, fulfilling, and social empowering employment; if only technology, and specifically the Web, where available to them. An ageing but Web literate population indicates a large market for online shopping and services especially when mobility is a problem for the shopper. In this case we wonder how this new population will interact with Web based resources, and what new problems in accessibility will there be to overcome? Will the Web provide the social, employment, and health care benefits currently unavailable to older users? Will complex and highly graphical interfaces exclude ageing users from access? What problems exist, what are the upcoming problems, what solutions are required? How do the adoption patterns for Web access by older people vary across cultures? How do the adoption patterns for Web access by older people vary across cultures? Finally, what effect will an ageing user population have on the wider Web? We pose the question: "Web accessibility for older users - are we there yet?"

In this case topics of interests include (but are not limited to):


Accessibility for the Modern Web: Design and Evaluation of Accessible Websites

T1-Fullday Christopher Power, Helen Petrie, Andre Freire, Gerhard Weber, David Sloan and Giorgio Brajnik

Abstract. Building accessible sites that are vibrant, dynamic and very usable requires the use of appropriate technologies and the application of good evaluation techniques. This tutorial will present a variety of methods for designing and evaluating accessible websites. Through practical exercises attendees will learn how to conduct automatic, expert and user evaluations.

In the tutorial we cover the area of accessibility by:

Attendees of the tutorial will be able to link accessibility with usability, with an understanding that more accessible websites can provide more usable websites for the wider web audience.

Query Log Mining

T2-Fullday Fabrizio Silvestri and Ricardo Baeza-Yates

Abstract. Web Search Engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this tutorial is to introduce to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that could be used to extract and to exploit useful knowledge from this (potentially) infinite source of information. Furthermore, participants to this tutorial will be given a unified view on the literature on query log analysis.

Web Search Engine Metrics: Direct Metrics to Measure User Satisfaction

T3-Morning Ali Dasdan, Kostas Tsioutsiouliklis and Emre Velipasaoglu

Abstract. Web search is an indispensible resource for users to search for information on the Web. Web search is also an important service for publishers and advertisers to present their content to users. Thus, user satisfaction is the key and must be quantified. In this tutorial, our goal is to give a practical review of web search metrics from a user satisfaction point of view. This viewpoint is intuitive enough for a typical user to express his or her web search experience using it. The metrics that we are planning to cover fall into the following areas: relevance, coverage, comprehensiveness, diversity, discovery freshness, content freshness, and presentation. We will also describe how these metrics can be mapped to proxy metrics for the stages of a generic search engine pipeline. Our hope is that practitioners can apply these metrics readily and that researchers can get problems to work on, especially in formalizing and refining metrics.

Introduction to Social Computing

T4-Morning Irwin King

Abstract. With the advent of Web 2.0, Social Computing has emerged as one of the hot research topics recently. Social Computing involves the investigation of collective intelligence by using computational techniques such as machine learning, data mining, natural language processing, etc. on social behavior data collected from blogs, wikis, clickthrough data, query logs, tags, etc. from areas such as social networks, social search, social media, social bookmarks, social news, social knowledge sharing, and social games. In this tutorial, we will introduce Social Computing and elaborate on how the various characteristics and aspects are involved in the social platforms for collective intelligence. Moreover, we will discuss the challenging issues involved in Social Computing in the context of theory and models of social networks, mining of spatial and temporal social information, ways to deal with partial and incomplete information in the social context, scalability and algorithmic issues with social computational techniques, and security & privacy issues. Some potential social computing applications for discussion would include collaborative filtering, query log processing, learning to rank, large graph and link algorithms, etc.

Hello Open World - The Web of Data for the Pragmatic Developer

T5-Morning Michael Hausenblas and Alexandre Passant

Abstract. The convergence of the Social Web and the Semantic Web (Web of Data) alike as initiatives as the Linking Open Data community project yield new ways to publish, interlink and re-use information from different distributed data sources on the Web. Hence, people need to understand how to benefit from this new developments and learn about real-world tools and methods to implement applications on the Web of Data for a certain domain. This tutorial will give a step-by-step introduction combined with hands-on session in how to use structured data on theWeb, turn it into RDF, find and reuse existing vocabularies and finally build an application that consumes data from different sources and displays it in a user-friendly way, using a common xAMP environment.

Online advertising: business models, technologies and issues

T6-Afternoon James Shanahan

Abstract. Internet advertising revenues in the United States totaled $21 billion for 2007, up 25 percent versus 2006 revenues of $16.9 billion (according to the Interactive Advertising Bureau). Fueled by these growth rates and the desire to provide added incentives and opportunities for both advertisers and publishers, alternative business models to online advertising are been developed. This tutorial will review the main business models of online advertising including: the pay-per-impression model (CPM); and the pay-per-click model (CPC); a relative new comer, the pay-per-action model (CPA), where an action could be a product purchase, a site visit, a customer lead, or an email signup; and dynamic CPM (dCPM) which optimizes a campaign towards the sites and site sections that perform best for the advertiser.

This tutorial will also discuss in detail the technology being leveraged to automatically target ads within these business models; this largely derives from the fields of machine learning (e.g., logistic regression, online learning), statistical (e.g., binomial maximum likelihood), information retrieval (vector space model, BM25) and economics (auction mechanisms, game theory). Challenges such as click fraud (the spam of online advertising), deception, privacy and other open issues will also be discussed. Web 2.0 applications such as social networks, and video/photo-sharing pose new challenges for online advertising. These will also be discussed.

Learning to Rank for Information Retrieval

T7-Afternoon Tie-Yan Liu

Abstract. An introduction will be given to the new research area, learning to rank. Taking document retrieval as an example, the process of learning to rank is as follows. In learning, a training set of queries and their associated documents (with relevance judgments) are provided, and the ranking model is trained with the data in a supervised fashion, by minimizing certain loss functions. In ranking, the model is applied to new queries and sorts their associated documents. Three major approaches have been proposed in the literature, i.e., the pointwise, pairwise and listwise approaches to learning to rank. The pointwise approach solves the problem of ranking by means of regression or classification on single documents. The pairwise approach transforms ranking to classification on document pairs. The listwise approach tackles the ranking problem directly, by adopting listwise loss functions, or optimizing evaluation measures (e.g., NDCG and MAP). We will introduce the frameworks and representative algorithms of these approaches, and then discuss their advantages, weaknesses, application scopes, and underlying theories. In addition, we will also introduce some other works on learning to rank, including training data creation, semi-supervised ranking, and advanced ranking models in this tutorial.

Interactive Television and the Web

T8-Afternoon Pablo Cesar and Konstantinos Chorianopoulos

Summary. In this tutorial we examine research literature and commercial services and ask two simple questions: What is the scope of social and interactive TV research on the Web? Can it help us reinvent the practices of producing, distributing, and using mass media, such as TV? Rather than defining an over arching framework, we have identified three simple metaphors that have guided interactive TV research up to this day: interactive TV as control of audiovisual content, interactive TV as content creation practice, and, finally, interactive TV as a communication genre, which supports social processes. We propose this taxonomy so that designers of future interactive TV applications on the Web can choose their path of research, with awareness of other directions that have been explored in the past.

From SOA to REST - Designing and Implementing RESTful Services

T9-Fullday Cesare Pautasso and Erik Wilde

Abstract. Recent technology trends in Web services indicate that a solution eliminating the perceived complexity of the WS-* standard technology stack may be in sight: advocates of Representational State Transfer (REST) have come to believe that their ideas explaining why the World Wide Web works are just as applicable to solve enterprise application integration problems and to radically simplify the plumbing required to implement a Service-Oriented Architecture (SOA). In this tutorial we give an introduction to the REST architectural style as the foundation for RESTful Web services. The tutorial starts from the basic design principles of REST and how they are applied to service oriented computing.

Serviceorientation concentrates on identifying self-contained units of functionality, which should then be exposed as easily reusable and repurposable services. This tutorial focuses not on the identification of those units, but on how to design the services representing them. We explain how decisions on the SOA level already shape the architectural style that will be used for the eventual IT architecture, and how the SOA process itself has to be controlled to yield services which can then be implemented RESTfully. We do not claim that REST is the only architectural style that can be used for SOA design, but we do argue that it does have distinct advantages for loosely coupled services and massive scale, and that any SOA approach already has to be specifically RESTful on the business level to yield meaningful input for IT architecture design. To do so, we include concrete examples, showing how to consume RESTful Web services and how to build them (both from a technology and methodology point of view). The tutorial also includes a detailed comparison to the traditional WS-* technology stack, which helps to quantify the *perceived* complexity differences between WS-* and REST. On the one hand, the goal is to help technical decision makers to assess the benefits and limitations of the two technology frameworks objectively and to select the one that best fits their needs. On the other hand, we also aim to highlight some open research problems related to this emerging technology area.

Extracting, Searching and Mining Semantic Annotations on the Web

T10-Fullday Soumen Chakrabarti

Abstract. Automatic, large-scale semantic annotations on unstructuredWeb sources is vital to next-generation search and mining over entities and relations. The last decade has seen substantial research advances starting from the early WebKB vision of developing a probabilistic, symbolic knowledge base mirroring the contents of the Web. WebKB started with relatively modest goals of whole-page classification into students, faculty, department, course, etc. The WebKB proposal was followed by a rich literature on identifying not whole pages, but segments of text token as references to real-world entities from a predetermined set of entity types. The set of types may include persons, places, organizations, dates, paper titles, authors, and conference venues. Progress was also made on the task of identifying if entities of specified types were related in a given way, e.g., author A wrote book B, company C1 acquired company C2, person P joined organization O, etc. The next wave of innovations involved open-domain identification and typing of entity mentions, exploiting Hearst patterns of the form "Jordan and other statisticians" as evidence that some person named Jordan is a statistician. The next step was to identify open-domain binary relations from simple lexical patterns like noun phrase, verb phrase, noun phrase, as in "John Baird invented the television". More recently, efforts have been under way to associate token segments on Web pages with unique entity IDs, such as Wikipedia URNs. This tutorial will give an overview of these recent technologies, with emphasis on the machine learning and data mining techniques involved. Then we will explore how such entity annotations can assist search-free-form and especially semistructured searches.

Online Trust and Reputation Systems

T11-Morning Neel Sundaresan

Abstract. The World Wide Web has evolved from the early days of mainly informational plane to the transactional and conversational plane. eCommerce, Social Networks, and other communicative forms like Blogs, Twitter, and P2P messaging systems are prevalant. The challenges that search engines solved were related to page quality and search relevance. HITS, Pagerank and their derivatives solved these problems effectively. In the new web world the challenges include understanding and modelling beyond web pages of people, transactions, communities, and content. This tutorial will discuss trust and reputation systems as they become prime in this new world. We will give an overview of trust and reputation systems as has been studied by social network scientists and computer scientists. We will discuss online implementations of trust and reputation systems. We will discuss interesting research problems in this area and possible applications and platforms as well.

Rules on the Web

T12-Morning Benjamin Grosof, Mike Dean, Michael Kifer and Raphael Volz

Abstract. Rules is probably the most important frontier area today for the Semantic Web's core technology and standards, and is a main emerging area of the Web overall. Rules extend databases and ontologies with more powerful, flexible, and active forms of "structured" knowledge (as opposed to "unstructured" knowledge such as text), and have a number of close relationships to other aspects of the overall Web such as services, trust, query/search, and collective intelligence. There are a number of exciting research issues, and web rules functionality is being mainstreamed into core industry products such as Oracle's database suite. Recent progress includes fundamental advances in the underlying knowledge representation techniques and in the integration of rules with ontologies and database query/search; major initial industry standards from W3C and OMG nearing finalization; substantive translations between heterogeneous types of commercial rule engines; development of open-source tools for inferencing and interoperability; a wide range of emerging applications including in business, science, and trust; and accelerating industry investments/acquisitions in the technology including by integrated software companies such as Oracle, IBM, and Microsoft. This tutorial will provide a comprehensive and up-to-date introduction to these developments and to the fundamentals of the key technologies involved. It will explore example application scenarios, overall requirements and challenges, and touch upon business/social value and strategy considerations.

Developing Mobile Web Applications and Mobile Widgets

T13-Afternoon Alison Lee, Andreas Girgensohn

Abstract. With the wider availability of standards-based Web browsers on mobile devices such as smart phones like S60 and iPhone, designers, researchers and developers have the option of using the Web technologies and tools to quickly create mobile applications and deploy them to a variety of devices for evaluation and availability. This tutorial will explore and examine the use of mobile AJAX, Web services APIs, Web Runtime, and PyS60 Personal Http Server to create mobile Web applications and mobile widgets. These mobile applications and widgets can mashup data and functionality accessed over the Internet and/or from the mobile device using the same Web programming model that is common place on desktop browsers today.

This tutorial will focus on creating new or mobile versions of Web mashups exclusively for mobile devices. We will highlight differences due to mobility, mobile user experiences, device form factor, new and constrained device features, AJAX on browsers with limitation and data plan pricing considerations. We will discuss how to leverage mobile widgets and mobile Web applications in combination to enhance utility, performance, software maintenance, and needs and experiences of users on the go.

In a full-day version, we will have several time slots for hands-on experience with developing mobile mashups of Internet resources, phone resources as mobile Web applications and mobile widgets. In the half-day version, hands-on experience will be much limited and may be limited to walkthroughs of code fragments. For hands-on experience involving work with mobile devices, the attendees will be able to share the use of mobile devices provided by the instructors for the tutorial.

Detecting, Understanding and Exploiting Web communities

T14-Afternoon Athena Vakali and Ioannis Kompatsiaris

Abstract. Collective user activities on multiple, often heterogeneous and evolving Web sources contributes in the formation of Web communities which are either derived from Web documents/pages, or by users navigational tasks and more recently by tags and social frameworks. Defining, deriving and exploiting communities is not a trivial task since several parameters (large-scale, complexity, evolving information etc) are involved.

This tutorial aims at providing answers for crucial questions raised about communities emerging in the Web and it will initially walk the audience on issues involved towards different community definitions such that then, the problem of community detection (which is well matured and researched in the past) is understood. The tutorial will emphasize and discuss the most important methodologies and techniques which deal with large populations of Web documents participating in vast hyperlinked networks, or networks formed from crawling (part of) the web and more recently, networks reflecting the social relations and/or interactions among people. It is important to understand and categorize community identification efforts by taking into account that different levels of granularity and different views are often used for community identification. Next step of the tutorial will be to point applications and implementations which may be benefited by exploiting Web communities and certain application scenarios will be presented. The emphasis is on the intuition behind all these methodologies and implementations, and on their practical impact for tasks of recommendation, searching, content outsourcing, data administration etc.

W3C Track

Mobile Widgets Camp



  • Mobile Widgets and the Future of the Web, Dan Appelquist (Vodafone)
  • Mobile Widgets, Charles McCathieNevile (Opera Software)
  • Unleashing mobile user-provided services, (Telefonic)
  • Selection of Topics of Discussion for the afternoon sessions (1/2h)

Session 1

Session 2

Including presentation of mobile widgets.

Social Web Camp



  • Social Web at W3C, Harry Halpin, University of Edinburgh
  • FOAF update and planning, Dan Brickley, Vrije University
  • Selection of Topics of Discussion for the afternoon sessions (1/2h)

Session 1

Session 2

Including presentations of semantic Web tools.

Ibero-America Alternate Track

Ibero-America Panels

The New Media Challenge to Traditional Media. Spanish/Portuguese vs. English. How to avoid cultural break down.

Chair: Prof. Virgilio A. F. Almeida, Computer Science Department, Federal University of Minas Gerais (UFMG), Brazil

  • Mario Tascón, General Manager Dixired, lainformació
  • Rosalía Lloret, General Manager Interactive Media,
  • Icaro Moyano. PR Manager Tuenti


Panel on Multilingual Web Sites

Chair: Manuel Tomas Carrasco Benitez, Vice-chair: Luis Bellido

Abstract. Multilingual Web Sites (MWS) refers to sites with multilingual parallel texts; e.g., Europa, the portal of the European Union. For details on this subject, it is recommended reading the (draft) chair memo Open architecture for multilingual web sites. MWS are of great practical relevance: the call for proposals for Multilingual Web has a budget of 14 millions euros; since 2007 and for the first time, the European Commission has a Commissioner for Multilingualism (Mr. Leonard Orban).

There is an urgent need to address the standardization of MWS:

  • Users should expect similar behaviour from different MWS.
  • Webmasters should be capable of creating high quality low cost MWS.

All parties (i.e., panelists, participants and non-participants) are encouraged to comment the chair memo and/or to submit short position papers by emailing to the above address. The intention is to have active participants and not passive listeners. All contributions will be accepted and put online. Of particular interest would be implementations (running code wins). Authors of the most relevant contributions will be invited to do very short presentations. Microphone queue will be available, so participants could intervene at any time. The intention is to have a very interactive event: more like a workshop than a panel.

MWS is a subject with pedigree: in 1996, there was a track on this subject at the Web Internationalization & Multilinguism Symposium in Seville: hopefully, the long march from Seville could finally end in Madrid with a rough consensus to drive forward the standardization process.

Emerging technologies for combining web accessible interfaces with the development of new interoperable ICT devices

Chair: Enrique Varela-Couceiro, Fundación ONCE

  • Julio Abascal, UPV
  • Eduardo Carrasco, VICOMtech
  • Raul Riesco, INTECO
  • Roberto Torena, Technosite
  • Marc Pous, TMT Factory

Abstract. The objective of this panel is to discuss which kind of emerging technologies can be combined with the development of web interfaces to foster accessible and interoperable interaction with a huge amount of devices, that at present are not accessible (such as ATMs, smart houses applications, public kiosks, etc.).

Open Innovation and Open Labs: An emergent opportunity for leading technology innovation and development for the Future Internet

Chair: Martín Álvarez Espinar, W3C Spain Office Manager

  • Antonio Campos, CTIC Foundation
  • Eduardo Álvarez, CTIC Foundation
  • Juan J. Hierro, Chief Technologist, Software Technologies, Telefónica I+D
  • Jesús M. González-Barahona, Universidad Rey Juan Carlos

Abstract. . Open Labs are places where science and open innovation meet to develop, research and explore new ideas. In the panel session different Open Labs, where people, industry (both large enterprises and SMEs), and academia can work jointly together and bring added value for the Future Internet research, will be presented. This panel will controversially discuss the risks as well as the potentials and oportunities associated to this approach to innovation and technology development, with the aim of bringing this debate closer to the heart of the Iberoamerican WWW community.

Ibero-America Papers

Track Opening: Dr. Ricardo Baeza-Yates, Yahoo! Research
IberoAmerica-2 Dr. Javier Soriano, Universidad Politécnica de Madrid, Spain

Analyzing Seller Practices in a Brazilian Marketplace

IberoAmerica-21 Regular Track paper Adriano Pereira, Diego Duarte, Wagner Meira Jr., Virgilio Almeida and Paulo Goes

Abstract. E-commerce is growing at an exponential rate. In the last decade, there has been an explosion of online commercial activity enabled by World Wide Web (WWW). These days, many consumers are less attracted to online auctions, preferring to buy merchandise quickly using fixed-price negotiations. Sales at, the leader in online sales of fixed-price goods, rose 37% in the first quarter of 2008. At eBay, where auctions make up 58% of the site's sales, revenue rose 14%. In Brazil, probably by cultural influence, online auctions are not been popular. This work presents a characterization and analysis of fixed-price online negotiations. Using actual data from a Brazilian marketplace, we analyze seller practices, considering seller profiles and strategies. We show that different sellers adopt strategies according to their interests, abilities and experience. Moreover, we confirm that choosing a selling strategy is not simple, since it is important to consider the seller's characteristics to evaluate the applicability of a strategy. The work also provides a comparative analysis of some selling practices in Brazil with popular worldwide marketplaces.

Analysis of Fraudulent Activity in a Brazilian Auction Site

IberoAmerica-22 Alternate Track paper Vinicius Almendra and Daniel Schwabe

Abstract. Online auction fraud is the most common complaint according to Internet Crime and Complaint Center. Despite that, there are not many published empirical studies about fraud occurrence in online auction sites, and the existing ones mostly target eBay. This lack of research is even worse in Latin American countries. In this paper we present the results of an exploratory research on fraud occurrence in MercadoLivre, the biggest Brazilian auction site, which suggest that non-delivery fraud is a real, recurrent and measurable problem.

Analyzing and ranking the Spanish speaking MySpace community by their contributions in forums

IberoAmerica-23 Alternate Track paper Andreas Kaltenbrunner, Erika Bondia and Rafael Banchs

Abstract. This study analyzes the MySpace forums in Spanish language, which permits to extract otherwise restricted demographic data from the social network and leads to a detailed description of the Spanish speaking community in MySpace. Important differences in age and gender structures are found for users with different countries of origin. The distribution of activity among different threads and users is very heterogeneous and follows heavy tailed distributions. We furthermore propose two variants of the h-index which allow the ranking of users and threads by their importance in the forums.

Ibero-America Papers

IberoAmerica-4 Dr. Daniel Schwabe, Universidad PUC-Rio, Brazil

A Geographical Analysis of Knowledge Production in Computer Science

IberoAmerica-41 Regular Track paper Guilherme Vale Menezes, Nivio Ziviani, Alberto Laender and Virgilio Almeida.

Abstract. The goal of this paper is to analyze coauthorship networks in the Computer Science community. For this, we considered 30 graduate programs in Computer Science in different regions of the world, being 8 programs in Brazil, 16 in North America (3 in Canada and 13 in the United States) and 6 in Europe (2 in France, 1 in Switzerland and 3 in the United Kingdom). We collected data about 176,537 authors and 352,766 publication entries distributed among 2,176 publication venues. The results obtained from different measurements over the social networks show differences in the publication profile of Brazilian, European and North-American programs. For instance, the size of the giant component indicates the existence of isolated collaboration groups inside European programs, contrasting with their Brazilian and North-American counterparts. We also analyzed the temporal evolution of the social networks representing the three different regions. We observed that the number of collaborations grows faster than the number of authors, benefiting from the existing structure of the network. The temporal evolution also shows differences between well-established fields, such as Databases and Computer Architecture, and emerging fields, such as Bioinformatics and Geoinformatics. We also observed an increase on the average number of authors per paper for the three networks.

Portuguese Language Processing Service

IberoAmerica-42 Alternate Track paper Eraldo R. Fernandes, Ruy L. Milidiu and Cicero N. Santos

Abstract. Current Natural Language Processing tools provide shallow semantics for textual data. These kind of knowledge could be used in the Semantic Web. In this paper, we describe F-EXT-WS, a Portuguese Language Processing Service that is now available at the Web. The first version of this service provides Part-of-Speech Tagging, Noun Phrase Chunking and Named Entity Recognition. All these tools were built with the Entropy Guided Transformation Learning algorithm, a state-of-the-art Machine Learning algorithm for such tasks. We show the service architecture and interface. We also report on some experiments to evaluate the system's performance. The service is fast and reliable.

The Morfeo Open Source Community: building technologies of the Future Web through open innovation

IberoAmerica-43 Alternate Track paper David Lizcano, Miguel Jiménez, Javier Soriano, Juan J. Hierro and Andrés Leonardo Martínez.

Abstract. Nowadays enterprise collaboration is becoming essential for valuable innovation and competitive advantage. This collaboration has to be brought a step forward from technical collaboration till collective smart exploitation of global intelligence. Morfeo Open Source Community implements it as an open innovation schema of collaboration among SME's, universities, research centres, public administration and major corporations. Its disruptive contribution to the state of the art is to manage the collaboration licensing the shared technology as open, with an OSI accepted license. The main results of this innovation model are the development of SOA open software standards, the increase of OSS quality, the creation of an OSS services provider ecosystem, with an important impact in the local economical development and the dissemination of the R&D&i activities among a broader set of organizations. Now Morfeo, with established relationship in Spain and EU, is involved defining new cooperation schemas in Ibero-America. The aim is to define a network of Morfeo Office in several countries to coordinate the global community.

Ibero-America Papers

IberoAmerica-6 Dr. Jesús Gonzalez-Barahona, Universidad Rey Juan Carlos, Spain

Inquiro.CL: a New Search Engine in Chile

IberoAmerica-61 Alternate Track paper Marcelo Mendoza, Hipolito Guerrero and Julio Farias

Abstract. In this paper we present a new online search engine developed in Chile: Inquiro.CL. This new search engine lets users search the Latin-American web (speciffically, ten Spanish- speaking countries) by specifying their search domain (.cl,,, and the rest of Latin America). The structure is based on a distributed architecture that manages two document collections. The first is a cache of pages / sites that provides responses at low computational cost. The second is a general collection of documents that is visited when the user extends the search to the second results page or beyond. The index has been built using a multi-tier strategy, such that titles, URLs, headers, and complete site contents that make up the collection are cataloged in different indexes. The search engine uses a ranking function that combines various relevance measurements, among them user preferences, page rank, and text. Currently Inquiro's main collection reaches more than 1,500,000 pages and approximately 35,000 sites. Experimental results show that the search engine is precise and compares favorably with similar search engines, reaching an average of over 60% precision in the top 5 rankings.

Finding Answers to Definition Questions across the Spanish Web

IberoAmerica-62 Alternate Track paper Alejandro Figueroa

Abstract. In the last two years, we have been developing a multilingual web question answering system, which is aimed at discovering answers to natural language questions in three different languages: English, German and Spanish. One of its ma jor components is the module that extracts answers to definition questions from the web. This paper compares and provides insights into different techniques that we have used during these two years for tackling definition questions in Spanish. Additionally, the present work focuses its attention on new challenging issues regarding the language phenomena that adversely affect the answering process. In particular, in comparison with the analogous task in English.

Overcoming database heterogeneity to facilitate social networks: the Colombian displaced population as a case study

IberoAmerica-63 Alternate Track paper Juan F. Sequeda, Alexander Garcia, Oscar Corcho, Syed Hamid Tirmizi and Daniel Miranker.

Abstract. In this paper we describe a two-step approach for the publication of data about displaced people in Colombia, whose lack of homogeneity represents a major barrier for the application of adequate policies. This data is available in heterogeneous data sources, mainly relational, and is not connected to social networking sites. Our approach consists in a first step where ontologies are automatically derived from existing relational databases, exploiting the semantics underlying the SQL-DDL schema description, and a second step where these ontologies are aligned with existing ontologies (FOAF in our example), facilitating a better integration of data coming from multiple sources.


Request a BOF.


Total poster papers submitted (including the ones resubmitted from the research track): 225
Total accepted: 93
Total number of poster papers submitted from recommended research papers: 33
Total accepted: 29
Total number of regular poster papers: 192
Total accepted: 60
Acceptance rate for the regular poster papers: 31%

Posters 1


Posters 2


Posters 3



Invited Talk Session


Welcome to Day 1

Devel-11 Cong Yu and Raoul-Sam Daruwala


Socializing Big Data

Devel-12 Jeff Hammerbacher, Cloudera


WebNC: efficient sharing of web applications

Devel-13 15 minutes Laurent Denoue, Scott Carter, John Adcock and Gene Golovchinsky, FXPal

Abstract. WebNC is a browser plugin that leverages the Document Object Model for efficiently sharing web browser windows or recording web browsing sessions to be replayed later. Unlike existing screen-sharing or screencasting tools, WebNC is optimized to work with web pages where a lot of scrolling happens. Rendered pages are captured as image tiles, and transmitted to a central server through http post. Viewers can watch the webcasts in real-time or asynchronously using a standard web browser: WebNC only relies on html and javascript to reproduce the captured web content. Along with the visual content of web pages, WebNC also captures their layout and textual content for later retrieval. The resulting webcasts require very little bandwidth, are viewable on any modern web browser including the iPhone and Android phones, and are searchable by keyword.

Development I (New Programming Environments)


Will an easy to use language attract content experts?

Devel-21 30 minutes Kevin Miller, Revolution

Abstract. We propose a new web plugin, enabling content creators to step easily to the next level of producing compelling web applications from scratch without needing to learn obscure syntax and concepts. The plugin is based on the well established Revolution desktop development system, an English-like scripting language descended from the HyperCard tradition. The same language will run on the server side and on the client side directly embedded in HTML and executed as the page is sent or loaded. We will demonstrate how to build a web application using a simple simulation complete with multimedia, integrated information and embedded Revolution tags. We will include details of the public beta program.

Intrusive unit testing for Web applications

Devel-22 30 minutes Philippe Poulard, Reflex, INRIA

Abstract. Several tools have been designed for automating tests on Web applications. They usually drive browsers the same way people do: they click links, fill in forms, press buttons, and they check results, such as whether an expected text appears on the page.

WUnit is such a tool, but goes a step further: it can act inside the server, examine server-side components, and even modify them, which gives more controls to the tests to write. For example, in our tests we can get the user session server-side, store an arbitrary object (that does have sense in our test) in the user session, and get the page that renders it. Cheating like this allows to bypass the normal functioning of a Web application and really perform independent unit tests, which gives more flexibility to test-driven development.

In this session, we'll present the Active Tags framework from which is derived WUnit, and experiment how to design simply a test suite for an AJAX-based application: how to get a page, fill a form, upload files, and of course how to act on server-sides components.

Creating Personal Mobile Widgets without Programming

Devel-23 30 minutes Geetha Manjunath, Thara S, Hitesh Bosamiya, Santhi Guntupalli, Vinay Kumar and Ragu Raman G., HP

Abstract. Our goal is to radically simplify the web so that end-users can perform their personal tasks with just a single click. For this, we define a concept called TaskLet to represent a task-based personal interaction pattern and propose a platform for automatically creating, sharing & executing them. TaskLets can be created even by a naive web user without programming knowledge, as we use the technique of Programming-By-Demonstration. TaskLets can be deployed on the client, cloud or on telecom provider network - enabling intuitive web interaction through widgets, thin mobile browsers as well as from mobile phones via SMS and Voice. Our key innovation is a tool and platform to enable end-users to simplify their personally valuable task. We wish to share the proposed tool with both expert web developers and naïve users of WWW to promote a wider developer community for mobile web applications.

Mashing Up Solutions to Old Problems


The Future of Vertical Search Engines with Yahoo! Boss

Devel-31 30 minutes Ted Drake, Yahoo! France

Abstract. While general search engines, such as Google, Yahoo!, and Ask, dominate the search industry; there is a new batch of vertical, niche search engines that could fundamentally change the behavior of search. These search engines are built on the open API's of Yahoo, Google, and other major players. However, Yahoo's recently released BOSS API has made these engines more powerful, more specialized, and easier to build and maintain.

These specialized search engines are finding the quality information in the haystack of general search terms. The general search engines, in turn, surface the niche results pages. This talk will discuss how to use the Yahoo! Boss search API to create a niche search engine. It will also show how this information can be mashed together with other APIs and finally how the results pages begin appearing in the general search engines. The talk is appropriate for engineers, entrepreneurs, and managers.

CentMail: Rate Limiting via Certified Micro-Donations

Devel-32 30 minutes Sharad Goel, Jake Hofman, John Langford, David Pennock and Daniel Reeves, Yahoo! Inc.

Abstract. We present a plausible path toward adoption of email postage stamps--an oft-cited method for fighting spam--along with a protocol and a prototype implementation. In the standard approach, neither senders nor recipients gain by joining unilaterally, and senders lose money. Our system, called CentMail, begins as a charity fund-raising tool: Users donate $0.01 to a charity of their choice for each email they send. The user benefits by helping a cause, promoting it to friends, and potentially attracting matching donations, often at no additional cost beyond what they planned to donate anyway. Charitable organizations benefit and so may appeal to their members to join. The sender's email client inserts a uniquely generated CentMail stamp into each message. The recipient's email client verifies with CentMail that the stamp is valid for that specific message and has not been queried by an unexpectedly large number of other recipients. More generally, the system can serve to rate-limit and validate many types of transactions, broadly construed, from weblog comments to web links to account creation.

Combining multi-level audio descriptors via web identification and aggregation

Devel-33 30 minutes Jun Wang, Xavier Amatriain and David Garcia Garzon. Chinese Academy of Sciences, Telefonica Research, David Garcia Garzon

Abstract. In this paper, we present the CLAM Aggregator tool. It offers a convenient GUI to combine multi-level audio descriptors. A reliable method is embedded in the tool to identify users' local music collection with the open data resources. In the context of CLAM framework and Annotator application, Aggregator allows users to configure, aggregate and edit music information ranging from low-level frame scales, to segment scales, and further to any metadata from the outside world such as semantic web. All these steps are designed in a flexible, graphical and user-defined way.

Invited Talk Session


Welcome to Day 2

Devel-41 Cong Yu and Raoul-Sam Daruwala


Making social networks more useful: Aardvark Social Search



The Web, Smart and Fast

Devel-43 15 minutes Olivier Rossel, Datao

Abstract. The Web is considered literature by many users, who cannot spend that much time browsing and reading textual content. This issue is even bigger when dealing with Web 2.0 sites, where the user is more interested in data than textual content.

Considering the expectations of people for instant access to data on the Web, we propose the paradigm of the Smart & Fast Web. We discuss the concept, the expectations, the technologies. At last, we point to, a project for a semantic web browser, to surf the Web of Data with all the requirements for a Smart & Fast Web.

Development II (Performance and Scalability)


Web data processing with MapReduce and Hadoop

Devel-51 45 minutes Tom White and Christophe Bisciglia, Cloudera

Abstract. With massive growth in website traffic, extracting valuable information from clickstreams is a challenge as existing tools struggle to scale with web scale data. Apache Hadoop is a system for storing and processing massive amounts of data in parallel on clusters of commodity machines. With Hadoop and MapReduce it becomes feasible to make ad hoc queries over the massive datasets, opening up new possibilities for unearthing insights in web scale data.

This talk will consist of two parts. The first part will be a brief introduction to MapReduce and Hive, Hadoop's processing and data warehousing components, and will explain how these technologies are designed to handle big data. The second part will be a demo, showing how Hadoop can be used in practice to mine web logs.

Jeff Hammerbacher is giving a keynote, and was asked to have Cloudera also submit a more technical talk for this developer track. Tom White will be joining Jeff in Madrid. Christophe Bisciglia manages our conference schedule, and does not need to be cited for this submission. Only Tom will be speaking for this talk.

Bootstrapping Web Pages for Accessibility and Performance

Devel-52 15 minutes Clint Hall, Cerner Corporation

Abstract. In this talk, I present a technique called Web Bootstrapping, a process by which an accurate collection of only those static resources and metadata necessary for a unique experience be delivered passively, by the most performant means possible. In further contrast to existing methodologies, rather than focus on the web client's identity or version, this approach determines resources based on capability, form factor and platform by collecting the often-immutable attributes of the client. Bootstrapping allows for rule-based, externalized, server-side configuration, further promoting progressive enhancement and client performance.

YUI 3: Faster, Lighter, Easier

Devel-53 30 minutes Matt Sweeney, Yahoo! Inc.

Abstract. The Yahoo! User Interface Library has been widely adopted by the mainstream web development community, and is used to power websites worldwide.

YUI 3 is the next generation of YUI, tailored to the feedback that we have received from thousands of users and developers.

This talk will give an overview of YUI 3, including what's new, what's changed, and where we are going from here.

Semantic Web


DBpedia - A Linked Data Hub and Data Source for Web Applications and Enterprises

Devel-61 45 minutes Georgi Kobilarov, Christian Bizer, Sören Auer and Jens Lehmann, Freie Universität Berlin, Universität Leipzig

Abstract. The DBpedia project provides Linked Data identifiers for currently 2.6 million things and serves a large knowledge base of structured information. DBpedia developed into the central interlinking hub for the Linking Open Data project, its URIs are used within named entity recognition services such as OpenCalais and annotation services such as Faviki, and the BBC started using DBpedia as their central semantic backbone. DBpedia's structured data serves as background information in the process interlinking datasets and provides a rich source of information for application developers. Beside making the DBpedia knowledge base available as linked data and RDF dumps, we offer a Lookup Service which can be used by applications to discover URIs for identifying concepts, and a SPARQL endpoint that can be retrieve data from the DBpedia knowledge base to be used in applications. This talk will give an introduction to DBpedia for web developers and an overview of DBpedia's development over the last year. We will demonstrate how DBpedia URIs are used for document annotation and how Web applications can via DBpedia facilitate Wikipedia as a source of structured knowledge.

A Semantic Web Ready Service Language for Large-Scale Earth Science Archives

Devel-62 15 minutes Peter Baumann, Jacobs University Bremen

Abstract. Geo data classically are categorized into vector, raster, and meta data. The latter category receives plentiful attention by the Web research community; vector data are considered to some extent, but raster data, contributing the largest data volumes, are neglected hitherto. Hence, today only on metadata level service offerings can only be consumed in a semantically adequate manner, while requests addressing the contents of raster sets cannot be done at all or only through APIs. In the end, they lack automatic contents discovery, service chaining, as well as flexible retrieval and analysis. The Open GeoSpatial Consortium (OGC) Web Coverage Processing Service (WCPS) standard defines a raster retrieval language suitable for ad-hoc navigation, extraction, and analysis of large, multi-dimensional data sets. Due to its formalized semantics definition it is suitable for reasoning and automatic service chaining. Based on real-life use cases drawn from remote sensing, oceanography, geophysics, and climate modeling we discuss the language concept, design rationales, and features like discoverability, declarativeness, safe evaluation, and optimizability. Further, the reference implementation stack, which will be released in open source, is detailed.

A new tool to improve the filtering options in advanced searching

Devel-63 30 minutes Fernando Moreno-Torres, MTC Soft

Abstract. We have developed a software application that analyzes in detail a text in English and labels the text with linguistic attributes plus additional information in a fully automatic way. With this new texts, a search engine is able to index the information in a way that provides new filtering possibilities for advanced searches.



Query Portals

Devel-71 30 minutes Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Gantai, Arnd Konig and Dong Xin, Microsoft Research

Abstract. Our goal is to enable users to efficiently and effectively search the web for informational queries and browse the content relevant to their queries. We achieve a unique "portal" like functionality for each query by effectively exploiting structured and unstructured content. We exploit existing structured data to identify and return per query a set of highly relevant entities such as people, products, movies, locations. Further, we are able to return additional information about the retrieved entities, such as categories, refined queries, and web sites which provide detailed information for each entity. The combination of search results and structured data creates a rich set of results, for the user to focus on and refine their search.

Towards Distributed Social Search Engines

Devel-72 30 minutes Josep M. Pujol and Pablo Rodriguez, Telefonica Research

Abstract. We describe a distributed social search engine build upon open-source tools aiming to help the community to {\em take back the Search}. Access to the Web is universal and open, and so the mechanisms to search should be. We envision search as a basic service whose operation is controlled and maintained by the community itself. To that end we present an alpha version of what could become the platform of a distributed search engine fueled by the shared resources and collaboration of the community.

Integration at Web-Scale: Scalable Agent Technology for Enabling Structured Vertical Search

Devel-73 30 minutes Govind Kabra and Kevin Chang

Abstract. The Web today has "everything": Every object of interest in real world is starting to find its presence in the online World. As such, the search needs of users are getting increasingly sophisticated. How do you search for apartments? How do you find products to buy? Traditional paradigm of Web search, starting from keyword input, and ending in Web pages as output, stifles users---requiring intensive manual post-processing of search results.s the solution, we present a platform for enabling vertical search. At the core is our novel agent technology for structured crawling. We showcase the promise of our platform using two concrete products that clearly demonstrate the possibility of Integration at Web-Scale.

The talk will show demos of two concrete products (apartment search, shopping search) and key underlying technologies.

The slides are attached in pdf format. The power-point version are online (for better animation):

Lightning Talks and Demonstations


Query GeoParser: A Spatial-Keyword Query Parser Using Regular Expressions

Devel-81 15 minutes Jason Hines and Tony Abou-Assaleh, GenieKnows

Abstract. There has been a growing commercial interest in local information within Geographic Information Retrieval, or GIR, systems. Local search engines enable the user to search for entities that contain both textual and spatial information, such as Web pages containing addresses or a business directory. Thus, queries to these systems may contain both spatial and textual components-spatial-keyword queries. Parsing the queries requires breaking the query into textual keywords, and identifying components of the geo-spatial description. For example, the query 'Hotels near 1567 Argyle St, Halifax, NS' could be parsed as having the keyword 'Hotels', the preposition 'near', the street number '1567', the street name 'Argyle', the street suffix 'St', the city 'Halifax', and the province 'NS'. Developing an accurate query parser is essential to providing relevant search results. Such a query parser can also be utilized in extracting geographic information from Web pages.
One approach to developing such a parser is to use regular expressions. Our Query GeoParser is a simple, but powerful, regular expression-based spatial-keyword query parser. Query GeoParser is implemented in Perl and utilizes many of Perl's capabilities in optimizing regular expressions. By starting with regular expression building blocks for common entities such as number and streets, and combining them into larger regular expressions, we are able handle over 400 different cases while keeping the code manageable and easy to maintain. We employ the mark-and-match technique to improve the parsing efficiency. First we mark numbers, city names, and states. Following, we use matching to extract keywords and geographic entities. The advantages of our approach include manageability, performance, and easy exception handling. Drawbacks include a lack of geographic hierarchy and the inherent difficulty in dealing with misspellings. We comment on our overall experience using such a parser in a production environment, what we have learnt, and suggest possible ways to deal with the drawbacks.

A Virtual Oceanographic Data Center

Devel-82 15 minutes Sean McCleese, Chris Mattmann, Rob Raskin, Dan Crichton and Sean Hardman. NASA JPL / USC

Abstract. Oceanographic datacenters at the National Aeronautics and Space Administration (NASA) are geographically sparse and disparate in their technological strengths and standardization on common Internet data formats. Virtualized search across these national assets would significantly benefit the oceans research community. To date, the lack of common software infrastructure and open APIs to access the data and descriptive metadata available at each site has precluded virtualized search. In this paper, we describe a nascent effort, called the Virtual Oceanographic Data Center, or VODC, whose goals are to overcome the challenge of virtualized search and data access across oceanographic data centers, and to provide a common and reusable capability for increasing access to large catalogs of NASA oceanographic data.

Creating Your Own Web-Deployed Street Map Using Open Source Software and Free Data

Devel-83 15 minutes Christopher Adams and Tony Abou-Assaleh, GenieKnows

Abstract. Street maps are a key element to Local Search; they make the connection between the search results, and the geography. Adding a map to your website can be easily done, using an API from a popular local search provider. However, the lists of restrictions are lengthy and customization can be costly, or impossible. It is possible to create a fully customizable web-deployed street map without sponsoring the corporate leviathans, at only the cost of your time and your server. Being able to freely style and customize your map is essential; it will distinguish your website from websites with shrink wrapped maps that everyone has seen. Using open source software adds to the level of customizability - you will not have to wait two years for the next release and then maybe get the anticipated new feature or the bug fix; you can make the change yourself. Using free data rids you of contracts, costly transactions, and hefty startup fees. As an example, we walk through creating a street map for the United States of America.
A Web-deployed street map consists of a server and a client. The server stores the map data including any custom refinements. The client requests a portion of the map and the server renders that portion and returns it to the client, which in turn displays it to the user. The map data used in this example is the Tiger/LINE data. Tiger/LINE data covers the whole of the USA. Another source of free road network data is OpenStreetMap, which is not as complete as Tiger/LINE but includes additional data such as points of interest and streets for other countries. Sometimes the original data is not formatted in a manner that attributes to a good looking, concise map. In such cases, data refinement is desired. For instance, performance and aesthetics of a map can be improved by transforming the street center lines to street polygons. For this task, we use the Python language, which has many extensions that make map data refinement easy. The rendering application employed is MapServer. MapServer allows you to specify a configuration file for your map, which consists of layers referencing geographical information, as well as the style attributes to specify how the layers are visualized. MapServer contains utilities to speed up the rendering process, and organize similar data. On the front end, we need a web-page embeddable client that can process requests for map movements, and scale changes in real time. In our experience, OpenLayers is this best tool for this task; it supports many existing protocols for requesting map tiles and is fast, customizable, and user friendly. Thus, deploying a street map service on the Web is feasible for individuals and not limited to big corporations.

Towards a Semantic Web Environment for XBRL

Devel-84 15 minutes Sheila Méndez Núñez, Jose Emilio Labra Gayo and Javier De Andrés. University of Oviedo

Abstract. BRL is an emerging standard that allows enterprises to present their financial accounting information in a XML format, according to their applicable legislation. In this paper we present a system which offer the possibility of adding semantic annotations contained in a financial ontology to XBRL reports. Furthermore, we present a new approach that will apply probabilistic methods to determine the degree of similarity between concepts of different XBRL taxonomies. . This approach will allow invertors, enterprises and XBRL users in general, to perform comparisons between reports that fit onto different taxonomies, even belonging to different countries.

Using the Web as our Content Management System on the BBC Music Beta

Devel-85 15 minutes Patrick Sinclair, Nicholas Humfrey, Yves Raimond, Tom Scott and Michael Smethurst. BBC Audio and Music Interactive

Abstract. In this paper, we describe the BBC Music Beta, providing a comprehensive guide to music content across the BBC. We publish a persistent web identifier for each resource in our music domain, which serves as an aggregation point for all information about it. We describe a promising approach in building web sites, by re-using structured data available elsewhere on the Web --- the Web becomes our Content Management System. We therefore ensure that the BBC Music Beta is a truly Semantic Web site, re-using data from a variety of places and publishing its data in a variety of formats.

Improving interaction via screen reader using ARIA: An example

Devel-86 15 minutes Marina Buzzi, Maria Claudia Buzzi, Barbara Leporini and Caterina Senette. National Research Council, Pisa, Italy

Abstract. An interface that conforms to W3C ARIA (Accessible Rich Internet Applications) suite would overcome accessibility and usability problems that prevent disabled users from actively contributing to the collaborative growth of knowledge. In a previous phase of our study we first identified problems of interaction via screen reader with Wikipedia [2], then proposed an ARIA-based modified Wikipedia editing page [1]. At this stage we only focused on the main content for editing/formatting purposes (using roles). To evaluate the effectiveness of an ARIA-based formatting toolbar we only dealt with the main content of the editing page, not the navigation and footer sections. The next step using ARIA is to introduce landmarks (regions) and use the "flowto" property to be able to change the order of how page content is announced via screen reader. In this way the new UI becomes functionally equivalent to the original Wikipedia editing page and its appearance is very similar (apart from an additional combobox instead of a list of links), but usability is greatly enhanced. In this demo we will show the interaction via Jaws screen reader using both the original and the proposed Wikipedia editing pages.

Data and the Cloud


A web-based rights management system for developing trusted value networks

Devel-91 30 minutes Víctor Torres, Jaime Delgado, Xavier Maroñas, Silvia Llorente and Marc Gauvin, Universitat Pompeu Fabra, Universitat Politècnica de Catalunya, NetPortedItems S.L.

Abstract. We present an innovative architecture that enables the digital representation of original works and derivatives while implementing Digital Rights Management (DRM) features. The architecture's main focus is on promoting trust within the multimedia content value networks rather than solely on content access and protection control. The system combines different features common in DRM systems such as licensing, content protection, authorization and reporting together with innovative concepts, such as the linkage of original and derived content and the definition of potential rights. The transmission of reporting requests across the content value network combined with the possibility for authors to preserve rights over derivative works enables the system to distribute income amongst all the actors involved in different steps of the creation and distribution chain. The implementation consists of a web application which interacts with different external services plus a desktop user application used to render protected content. It is currently publicly accessible for evaluation.

Silk - A Link Discovery Framework for the Web of Data

Devel-92 30 minutes Christian Bizer, Julius Volz and Georgi Kobilarov, Freie Universität Berlin, Chemnitz University of Technology

Abstract. The Web of Linked Data is built upon two simple ideas: First, to employ the RDF data model to publish structured data on the Web. Second, to set explicit RDF links between data items within different data sources.
The Silk link discovery framework supports developers in accomplishing the second task. Using a declarative language, developers can specify which types of RDF links are set between data items and which conditions data items must fulfil in order to be interlinked. Link conditions can combine various similarity metrics, aggregation and transformation functions. Link conditions may take the graph around a data items into account, which is addressed using an RDF path language. Silk accesses remote data sources via the SPARQL protocol and can be employed in situations where terms from different vocabularies are mixed and where no consistent RDFS or OWL schemata exist.
The talk gives an overview about the Silk framework and explains how developers can use Silk to interlink their data with the data cloud that is published by the W3C Linking Open Data community effort.

A REST Architecture for Social Disaster Management

Devel-93 30 minutes Julio Camarero and Carlos A. Iglesias. E.T.S.I. Telecomunicación (UPM), Ciudad Universitaria

Abstract. his article presents a social approach for disaster management, based on a public portal, so-called Disasters2.0, which provides facilities for integrating and sharing user generated information about disasters. The architecture of Disasters2.0 is designed following REST principles and integrates external mashups, such as Google Maps. This architecture has been integrated with different clients, including a mobile client, a multiagent system for assisting in the decentralized management of disasters, and an expert system for automatic assignment of resources to disasters. As a result, the platform allows seamless collaboration of humans and intelligent agents, and provides a novel web2.0 approach for multiagent and disaster management research and artificial intelligence teaching.

Refereed Papers

Click Models

Mining-1 Deepayan Chakrabarti

A Dynamic Bayesian Network Click Model for Web Search Ranking

647 Olivier Chapelle and Ya Zhang

Abstract. As with any application of machine learning, web search ranking requires labeled data. The labels usually come in the form of relevance assessments made by editors. Click logs can also provide an important source of implicit feedback and can be used as a cheap proxy for editorial labels. The main difficulty however comes from the so called position bias - urls appearing in lower positions are less likely to be clicked even if they are relevant. In this paper, we propose a Dynamic Bayesian Network which aims at providing us with unbiased estimation of the relevance from the click logs. Experiments show that the proposed click model outperforms other existing click models in predicting both click-through rate and relevance.

Click Chain Model in Web Search

836 Fan Guo, Chao Liu, Tom Minka, Yi-Min Wang and Christos Faloutsos

Abstract. Given a terabyte click log, can we build an efficient and effective click model? It is commonly believed that web search click logs are a gold mine for search business, because they reflect users' preference over web documents presented by the search engine. Click models provide a principled approach to inferring user-perceived relevance of web documents, which can be leveraged in numerous applications in search businesses. Due to the huge volume of click data, scalability is a must.
We present the click chain model (CCM), which is based on a solid, Bayesian framework. It is both scalable and incremental, perfectly meeting the computational challenges imposed by the voluminous click logs that constantly grow. We conduct a reproducible experimental study on a data set containing 8.8 million query sessions obtained in July 2008 from a commercial search engine. CCM consistently outperforms two state-of-the-art competitors in a number of metrics, with over 12% better log-likelihood, more than 6% improvement in perplexity and 10% improvement in the prediction quality of the first and the last clicked position.

Spatio-Temporal Models for Estimating Click-through Rate

629 Deepak Agarwal, Bee-Chung Chen and Pradheep Elango

Abstract. We propose novel spatio-temporal models to estimate click-through rates in the context of content recommendation. We track article CTR at a fixed location over time through a dynamic Gamma-Poisson model and combine information from correlated locations through dynamic linear regressions, significantly improving on per-location model. Our models adjust for user fatigue through an exponential tilt to the first-view CTR (probability of click on first article exposure) that is based only on user-specific repeat-exposure features. We illustrate our approach on data obtained from a module (Today Module) published regularly on Yahoo! Front Page and demonstrate significant improvement over commonly used baseline methods. Large scale simulation experiments to study the performance of our models under different scenarios provide encouraging results. Throughout, all modeling assumptions are validated via rigorous exploratory data analysis.

Recommender Systems

Social-1 Aya Soffer

Tagommenders: Connecting Users to Items through Tags

233 Shilad Sen, Jesse Vig and John Riedl

Abstract. Tags have emerged as a powerful mechanism that enable users to find, organize, and understand online entities. Recommender systems similarly enable users to efficiently navigate vast collections of items. Algorithms combining tags with recommenders may deliver both the automation inherent in recommenders, and the flexibility and conceptual comprehensibility inherent in tagging systems. In this paper we explore tagommenders, recommender algorithms that predict users' preferences for items based on their inferred preferences for tags. We describe tag preference inference algorithms based on users' interactions with tags and movies, and evaluate these algorithms based on tag preference ratings collected from 995 MovieLens users. We design and evaluate algorithms that predict users' ratings for movies based on their inferred tag preferences. Our research promises to increase the accuracy of recommendations in tagging systems, and it may lead to flexible recommender systems that leverage the characteristics of items users find most important.

Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior

365 Wen-Yen Chen, Jon-Chyuan Chu, Junyi Luan, Hongjie Bai and Edward Chang

Abstract. Users of social networking services can connect with each other by forming communities for online interaction. Yet as the number of communities hosted by such websites grows over time, users have even greater need for effective community recommendations in order to meet more users. In this paper, we investigate two algorithms from very different domains and evaluate their effectiveness for personalized community recommendation. First is Association Rule Mining (ARM), which discovers associations between sets of communities that are shared across many users. Second is Latent Dirichlet Allocation (LDA), which models user-community co-occurrences using latent aspects. In comparing LDA with ARM, we are interested in discovering whether modeling latent structure is more effective for recommendations versus directly mining rules from the observed data. We experiment on an Orkut data set consisting of 492,104 users and 118,002 communities. We show that LDA consistently performs better than ARM using the top-k recommendations ranking metric, and we analyze examples of the latent information learned by LDA to explain this finding. To efficiently handle the large-scale data set, we parallelize LDA on distributed computers and demonstrate our parallel implementation's scalability with varying numbers of machines.

Personalized Recommendation on Dynamic Contents Using Predictive Bilinear Models

713 Wei Chu and Seung-Taek Park

Abstract. In Web-based services of dynamic contents (such as news articles), recommender systems always face the difficulties of timely identifying new items of high-quality and providing recommendations for new users. We propose a feature-based machine learning approach to personalized recommendation that is capable of handling the cold-start issues effectively. We maintain profiles of contents, in which temporal characteristics of contents, e.g. popularity or freshness, are updated in real-time manner. We also maintain profiles of users including demographical information and a summary of user activities within Yahoo! properties. Based on all features in user and content profiles, we develop predictive bilinear regression models to provide accurate personalized recommendations of new items for both existing and new users. This approach results in an offline model with light computational overhead compared with other recommender systems that require online re-training. The proposed framework is general and flexible for other personalized tasks. The superior performance of our approach is verified on a large-scale data set collected from the Today Module on Yahoo! Front Page, with comparison against six competitive approaches.

XML Extraction and Crawling

XML-1 Mounia Lalmas

Extracting Article Text from the Web with Maximum Subsequence Segmentation

869 Jeff Pasternack and Dan Roth

Abstract. Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are excellent (our semi-supervised algorithm yields an overall F1-score of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect block-selection heuristic. We also evaluate against the recent CleanEval shared task with surprisingly good cross-task performance cleaning general web pages, exceeding the top "text-only" score (based on Levenshtein distance), 87.8% versus 84.1%.

Extracting Data Records from the Web Using Tag Path Clustering

524 Gengxin Miao, Junichi Tatemura, Wang-Pin Hsiung, Arsany Sawires and Louise Moser

Abstract. Fully automatic methods to extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy page structures due to a key limitation -- their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the document DOM tree. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous work.

Sitemaps: Above and Beyond the Crawl of Duty

616 Uri Schonfeld and Narayanan Shivakumar

Abstract. Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages' outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web, the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how "classic" discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs.

End User Web Engineering

Engine-1 Martin Gaedke

Rapid Development of Spreadsheet-based Web Mashups

62 Woralak Kongdenfha, Boualem Benatallah, Julien Vayssière, Regis Saint-Paul and Fabio Casati

Abstract. The rapid growth of social networking sites and web communities have motivated web sites to expose their APIs to external developers who create mashups by assembling existing functionalities. Current APIs, however, aim toward developers with programming expertise; they are not directly usable by wider class of users who do not have programming background, but would nevertheless like to build their own mashups. To address this need, we propose a spreadsheet-based Web mashups development framework, which enables users to develop mashups in the popular spreadsheet environment. Our contributions are as follows. First, we provide a mechanism that makes structured data first class value of spreadsheet cells. Second, we propose a new component model that can be used to develop fairly sophisticated mashups, involving joining data sources, and keeping data presenting on spreadsheets up to date. Third, to simplify mashup development, we provide a collection of spreadsheet-based mashup patterns that capture common Web data access and spreadsheet presentation functionalities. Users can reuse and customize these patterns to build spreadsheet-based Web mashups instead of developing them from scratch. Fourth, we enable users to manipulate structured data presented on spreadsheet in a drag-and-drop fashion. Finally, we have developed and tested a proof-of-concept prototype to demonstrate the utility of the proposed framework.

Mashroom: End-User Mashup Programming Using Nested Tables Mashups

368 Guiling Wang, Shaohua Yang and Yanbo Han

Abstract. This paper presents an end-user-oriented programming environment called Mashroom. Major contributions herein include an end-user programming model with an expressive and graphical data structure as well as a set of formally-defined visual mashup operators. The data structure takes advantage of nested table, and maintains the intuitiveness while allowing users to express complex data objects. The mashup operators visualized with contextual menu and formula bar are directly applied on the data. Experiments and case studies reveal that end users have little difficulty in effectively and efficiently using Mashroom to build mashup applications.

Automated Construction of Web Accessibility Models from Transaction Click Streams

391 Jalal Mahmud, Yevgen Borodin, I.V. Ramakrishnan and C. R. Ramakrishnan

Abstract. Screen readers, the dominant assistive technology used by visually impaired users to access theWeb, function by speaking out the content of the screen serially. Using screen readers for conducting online transactions can cause considerable information overload, because transactions, such as shopping and paying bills, typically involve a number of steps spanning several web pages. One can combat this overload with a transaction model for web accessibility that presents only fragments of web pages that are needed for doing transactions. We can realize such a model by coupling a process automata, encoding states of a transaction, with concept classifiers that identify page fragments "relevant" to a particular state of the transaction. In this paper we present a fully automated process that synergistically combines several techniques for transforming click stream data generated by transactions into a transaction model. These techniques include web content analysis to partition a web page into segments consisting of semantically related content elements, contextual analysis of data surrounding clickable objects in a page, and machine learning methods, such as clustering of page segments based on contextual analysis, statistical classification, and automata learning. A unique aspect of the transformation process is that the click streams, that serve as the training data for the learning methods, need not be labeled. More generally, it operates with partially labeled click stream data where some or all the labels could be missing. Not having to rely exclusively on (manually) labeled click stream data has important benefits: (i) visually impaired users do not have to depend on sighted users for the training data needed to construct transaction models; (ii) it is possible to mine personalized models from transaction click streams associated with sites that visually impaired users visit regularly; (iii) since partially labeled data is relatively easy to obtain, it is feasible to scale up the construction of domain-specific transaction models (e.g.: separate models for shopping, airline reservations, bill payments, etc.); (iv) adjusting the performance of deployed models over time with new training data is also doable. We provide preliminary experimental evidence of the practical effectiveness of both domain-specific, as well as personalized accessibility transaction models built using our approach. Finally, this approach is applicable for building transaction models for mobile devices with limited-size displays, as well as for creating wrappers for information extraction from web sites.

Graph Algorithms

Mining-2 Jian Pei

Fast Dynamic Reranking in Large Graphs

791 Purnamrita Sarkar and Andrew Moore

Abstract. In this paper we consider the problem of re-ranking search results by incorporating user feedback. We present a graph theoretic measure for discriminating irrelevant results from relevant results using a few labeled examples provided by the user. The key intuition is that nodes relatively closer (in graph topology) to the relevant nodes than the irrelevant nodes are more likely to be relevant. We present a simple sampling algorithm to evaluate this measure at specific nodes of interest, and an efficient branch and bound algorithm to compute the top $k$ nodes from the entire graph under this measure. On quantifiable prediction tasks the introduced measure outperforms other diffusion-based proximity measures which take only the positive relevance feedback into account. On the entity-relation graph built from the authors and papers of the entire DBLP citation corpus ($1.4$ million nodes and $2.2$ million edges) our branch and bound algorithm takes about $1.5$ seconds to retrieve the top $10$ nodes w.r.t. this measure with $10$ labeled nodes.

Estimating the ImpressionRank of Web Pages

469 Ziv Bar-Yossef and Maxim Gurevich

Abstract. The ImpressionRank of a web page (or, more generally, of a web site) is the number of times users viewed the page while browsing search results. ImpressionRank captures the visibility of pages and sites in search engines and is thus an important measure, which is of interest to web site owners, competitors, market analysts, and end users.
All previous approaches to estimating the ImpressionRank of a page rely on privileged access to private data sources, like the search engine's query log. In this paper we present the first {\em external} algorithm for estimating the ImpressionRank of a web page. This algorithm relies on access to three public data sources: the search engine, the query suggestion service of the search engine, and the web. In addition, the algorithm is {\em local} and uses modest resources. It can therefore be used by almost any party to estimate the ImpressionRank of any page on any search engine. Empirical analysis of the algorithm on the Google and Yahoo!\ search engines indicates that it is accurate and provides interesting observations on sites and search queries.

Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement

334 Jiang Bian, Yandong Liu, Ding Zhou, Eugene Agichtein and Hongyuan Zha

Abstract. Community Question Answering (CQA) has emerged as a popular forum for users to pose questions for other users to answer. Over the last few years, CQA portals such as Naver and Yahoo! Answers have exploded in popularity, and now provide a viable alternative to general purpose web search. At the same time, the answers to past questions submitted in CQA sites comprise a valuable knowledge repository which could be a gold mine for information retrieval and automatic question answering. Unfortunately, the quality of the submitted questions and answers varies widely - increasingly so that a large fraction of the content is not usable for answering queries. Previous approaches for retrieving relevant and high quality content have been proposed, but they require large amounts of manually labeled data -- which limits the applicability of the supervised approaches to new sites and domains. In this paper we address this problem by developing a semi- supervised coupled mutual reinforcement framework for simultaneously calculating content quality and user reputation, that requires relatively few labeled examples to initialize the training process. Results of a large scale evaluation demonstrate that our methods are more effective than previous approaches for finding high-quality answers, questions, and users. More importantly, our quality estimation significantly improves the accuracy of search over CQA archives over the state-of- the-art methods.

Interactions in Social Communities

Social-3 Lise Getoor

Network Analysis of Collaboration Structure in Wikipedia

115 Ulrik Brandes, Patrick Kenis, Juergen Lerner and Denise van Raaij

Abstract. In this paper we give models and algorithms to describe and analyze the collaboration among authors of Wikipedia from a network analytical perspective. The edit-network encodes who interacts how with whom when editing an article; it significantly extends previous network-models that code author communities in Wikipedia. Several characteristics summarizing some aspects of the organization process and allowing the analyst to identify certain types of authors can be obtained from the edit-network. Moreover, we propose several indicators characterizing the global network structure and methods to visualize edit-networks. It is shown that the structural network indicators are correlated with quality labels of the associated Wikipedia articles.

The Slashdot Zoo: Mining a Social Network with Negative Edges

348 Jérôme Kunegis, Andreas Lommatzsch and Christian Bauckhage

Abstract. We analyse the corpus of user relationships of the Slashdot technology news site. The data was collected from the Slashdot Zoo feature where users of the website can tag other users as friends and foes, providing positive and negative endorsements. We adapt social network analysis techniques to the problem of negative edge weights. In particular, we consider signed variants of global network characteristics such as the clustering coefficient, node-level characteristics such as centrality and popularity measures, and link-level characteristics such as distances and similarity measures. We evaluate these measures on the task of identifying unpopular users, as well as on the task of predicting the sign of links and show that the network exhibits multiplicative transitivity which allow algebraic methods based on matrix multiplication to be used. We compare our methods to traditional methods which are only suitable for positively weighted edges.

Community Gravity: Measuring Bidirectional Effects by Trust and Rating on Online

193 Yutaka Matsuo and Hikaru Yamamoto

Abstract. Measuring the similarity between semantic relations that hold among entities is an important and necessary step in various Web related tasks such as relation extraction, information retrieval and analogy detection. For example, consider the case in which a person knows a pair of entities (e.g. \textit{Google, YouTube}), between which a particular relation holds (e.g. acquisition). The person is interested in retrieving other_such pairs with similar relations (e.g. \textit{Microsoft, Powerset}). Existing keyword-based search engines cannot be applied directly in this case because, in keyword-based search, the goal is to retrieve documents that are relevant to the words used in a query -- not necessarily to the relations implied by a pair of words. We propose a relational similarity measure, using a Web search engine, to compute the similarity between semantic relations implied by two pairs of words. Our method has three components: representing the various semantic relations that exist between a pair of words using automatically extracted lexical patterns, clustering the extracted lexical patterns to identify the different patterns that express a particular semantic relation, and measuring the similarity between semantic relations using a metric learning approach. We evaluate the proposed method in two tasks: classifying semantic relations between named entities, and solving word-analogy questions. The proposed method outperforms all baselines in a relation classification task with a statistically significant average precision score of $0.74$. Moreover, it reduces the time take by Latent Relational Analysis to process $374$ word-analogy questions from $9$ days to less than $6$ hours, with a SAT score of $51\%$.

XML Querying

XML-2 Mounia Lalmas

Performing grouping and aggregate functions in XML queries

107 Huayu Wu, TokWang Ling, Liang Xu and Zhifeng Bao

Abstract. Since more and more business data are represented in XML format, there is a compelling need of supporting analytical operations in XML queries. Particularly, the latest version of XQuery proposed by W3C, XQuery 1.1, introduces a new construct to explicitly express grouping operation in FLWOR expression. Existing works in XML query processing mainly focus on physically matching query structure over XML document. Given the explicit grouping operation in a query, how to efficiently compute grouping and aggregate functions over XML document is not well studied yet. In this paper, we extend our previous XML query processing algorithm, VERT, to efficiently perform grouping and aggregate function in queries. The main technique of our approach is introducing relational tables to index values. Query pattern matching and aggregation computing are both conducted with table indexes. We also propose two semantic optimizations to further improve the query performance. Finally we present experimental results to validate the efficiency of our approach, over other existing approaches.

XQuery in the Browser

427 Ghislain Fourny, Markus Pilman, Daniela Florescu, Donald Kossmann, Tim Kraska and Darin McBeath

Abstract. Since the invention of the Web, the browser has become more and more powerful. By now, it is a programming and execution environment in itself. The predominant language to program applications in the browser today is JavaScript. With browsers becoming more powerful, JavaScript has been extended and new layers have been added (e.g., DOM-Support and XPath). Today, JavaScript is very successful and applications and GUI features implemented in the browser have become increasingly complex. The purpose of this paper is to improve the programmability of Web browsers by enabling the execution of XQuery programs in the browser. Although it has the potential to ideally replace JavaScript, it is possible to run it in addition to JavaScript for more flexibility. Furthermore, it allows instant code migration from the server to the client and vice-versa. This enables a significant simplification of the technology stack. The intuition is that programming the browser involves mostly XML (i.e., DOM) navigation and manipulation, and the XQuery family of W3C standards were designed exactly for that purpose. The paper proposes extensions to XQuery for Web browsers and gives a number of examples that demonstrate the usefulness of XQuery for the development of AJAX-style applications. Furthermore, the paper presents the design of an XQuery plug-in for Microsoft's Internet Explorer. The paper also gives examples of applications which were developed with the help of this plug-in.

Answering Approximate Queries over Autonomous Web Databases

259 Xiangfu Meng, Z. M. Ma and Li Yan

Abstract. To deal with the problem of empty or too little answers returned from a Web database in response to a user query, this paper proposes a novel approach to provide relevant and ranked query results. Based on the user original query, we speculate how much the user cares about each specified attribute and assign a corresponding weight to it. This original query is then rewritten as an approximate query by relaxing the query criteria range. The relaxation order of all specified attributes and the relaxed degree on each specified attribute are varied with the attribute weights. For the approximate query results, we generate users' contextual preferences from database workload and use them to create a priori orders of tuples in an off-line preprocessing step. Only a few representative orders are saved, each corresponding to a set of contexts. Then, these orders and associated contexts are used at query time to expeditiously provide ranked answers. Results of a preliminary user study demonstrate that our query relaxation and results ranking methods can capture the user's preferences effectively. The efficiency and effectiveness of our approach is also demonstrated by experimental result.

Service Oriented Development

Engine-2 Gerti

Combining Global Optimization with Local Selection for Efficient QoS-aware Service Composition

335 Mohammad Alrifai and Thomas Risse

Abstract. The run-time binding of web services has been recently put forward in order to support rapid and dynamic web service compositions. With the growing number of alternative web services that provide the same functionality but differ in quality parameters, the service composition becomes a decision problem on which component services should be selected such that user's end-to-end QoS requirements (e.g. availability, response time) and preferences (e.g. price) are satisfied. Although very efficient, local selection strategy fails short in handling global QoS requirements. Solutions based on global optimization, on the other hand, can handle global constraints, but their poor performance renders them inappropriate for applications with dynamic and real-time requirements. In this paper we address this problem and propose a solution that combines global optimization with local selection techniques to benefit from the advantages of both worlds. The proposed solution consists of two steps: first, we use mixed integer programming (MIP) to find the optimal decomposition of global QoS constraints into local constraints. Second, we use distributed local search to find the best web services that satisfy these local constraints. The results of experimental evaluation indicate that our approach significantly outperforms existing solutions in terms of computation time while achieving close-to-optimal results.

A Trust Management Framework for Service-Oriented Environments

509 William Conner, Arun Iyengar, Thomas Mikalsen, Isabelle Rouvellou and Klara Nahrstedt

Abstract. Many reputation management systems have been developed under the assumption that each entity in the system will use a variant of the same scoring function. Much of the previous work in reputation management has focused on providing robustness and improving performance for a given reputation scheme. In this paper, we present a reputation-based trust management framework that supports the synthesis of trust-related feedback from many different entities while also providing each entity with the flexibility to apply different scoring functions over the same feedback data for customized trust evaluations. We also propose a novel scheme to cache trust values based on recent client activity. To evaluate our approach, we implemented our trust management service and tested it on a realistic application scenario in both LAN and WAN distributed environments. Our results indicate that our trust management service can effectively support multiple scoring functions with low overhead and high availability.

Test Case Prioritization in Regression Testing of Service-Oriented Business Applications

794 Lijun Mei, Zhenyu Zhang, W.K. Chan and T.H. Tse

Abstract. Regression testing assures the quality of modified service-oriented business applications against unintended changes. However, a typical regression test suite is large in size. Executing those test cases that may detect failures earlier is attractive. Many existing prioritization techniques arrange test cases based on their individual code coverage on program statements of a previous version of the application. On the other hand, industrial service-oriented business applications are typically written in orchestration languages such as WS-BPEL and integrate workflow steps and web services via XPath and WSDL. Faults in these artifacts may cause the application to extract wrong data from messages and lead to failures in executing service compositions. Surprisingly, little existing regression testing research considers these artifacts. We propose a multilevel coverage model to capture business process, XPath and WSDL from the regression testing perspective. Atop the model, we develop a family of test case prioritization techniques. The empirical results show that our techniques can achieve significantly higher rates of fault detection than existing techniques.

Text Mining

Mining-3 Ravi Kumar

Detecting The Origin Of Text Segments Efficiently

419 Ossama Abdelhamid, Behshad Behzadi, Stefan Christoph and Monika Henzinger

Abstract. In the origin detection problem an algorithm is given a set $S$ of documents, ordered by creation time, and a query document $D$. It needs to output for every consecutive sequence of $k$ alphanumeric terms in $D$ the earliest document in $S$ in which the sequence appeared (if such a document exists.). Algorithms for the origin detection problem can, for example, be used to detect the ``origin'' of text segments in $D$ and thus to detect novel content in $D$. They can also find the document from which the author of $D$ has copied the most (or show that $D$ is mostly original.). We concentrate on solutions that use only a fixed amount of memory. We propose novel algorithms for this problem and evaluate them together with a large number of previously published algorithms. Our results show that (1) detecting the origin of text segments efficiently can be done with very high accuracy even when the space used is less than 1$\%$ of the size of the documents in $S$, (2) the precision degrades smoothly with the amount of available space, (3) various estimation techniques can be used to increase the performance of the algorithms.

Summarization through Structure Learning with Diversity, Coverage and Balance

739 Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha and Yong Yu

Abstract. Document summarization has played an ever more important role with the exponential growth of documents on the web. Many supervised and unsupervised approaches have been proposed to extract summaries from documents. However, these approaches seldom consider summary diversity, coverage, and balance issues which to a large extent determine the quality of summaries. In this paper we consider extract-based summarization with emphasis placed on three requirements: 1) diversity in summarization which seeks to reduce redundancy among sentences in the summary; 2) sufficient coverage which focuses on avoiding loss of key information of the document in the summary; and 3) balance which demands a equal amount of information about different aspects of a document in the summary. We formulate the extract-based summarization problem as learning a mapping from a set of sentences of a given document to a subset of the sentences that satisfies the above three requirements. The mapping is learned by incorporating several constraints in a structured learning framework to enhance diversity, coverage and balance of the summaries. Furthermore, we explore a graph structure of the output variable in the structure learning problem and employ structured SVM for its solution. Experiments on the DUC2001 data sets demonstrate significant improvement of performance in terms of the F1 and ROUGE metrics.

Efficient Overlap and Content Reuse Detection in Blogs and Online News Articles

622 Jong Wook Kim, K. Selcuk Candan and Junichi Tatemura

Abstract. The use of blogs to track and comment on real world (political, news, entertainment) events is growing. Similarly, as more individuals start relying on the Web as their primary information source and as more traditional media outlets try reaching consumers through alternative venues, the number of news sites on the Web is also continuously increasing. Content-reuse, whether in the form of extensive quotations or content borrowing across media outlets, is very common in blogs and news entries outlets tracking the same real-world event. Knowledge about which web entries re-use content from which others can be an effective asset when organizing these entries for presentation. On the other hand, this knowledge is not cheap to acquire: considering the size of the related space web entries, it is essential that the techniques developed for identifying re-use are fast and scalable. Furthermore, the dynamic nature of blog and news entries necessitates incremental processing for reuse detection. In this paper, we develop a novel qSign algorithm that effciently and effectively analyze the blogosphere for quotation and reuse identification. Experiment results show that with qSign processing time gains from 10X to 100X are possible while maintaining reuse detection rates of upto 90%. Furthermore, processing time gains can be pushed multiple orders of magnitude (from 100X to 1000X) for 70% recall.

Diffusion and Search in Social Networks

Social-2 Ronny Lempel

Social Search in "Small World" Experiments

45 Sharad Goel, Roby Muhamad and Duncan Watts

Abstract. The "algorithmic small-world hypothesis" states that not only are pairs of individuals in a large social network connected by short paths, but that ordinary individuals can find these paths. Although theoretically plausible, empirical evidence for the hypothesis is limited, as most chains in "small-world" experiments fail to complete, thereby biasing estimates of "true" chain lengths. Using data from two recent small-world experiments, comprising a total of 162,328 message chains, and directed at one of 30 "targets" spread across 19 countries, we model heterogeneity in chain attrition rates as a function of individual attributes. We then introduce a rigorous way of estimating true chain lengths that is provably unbiased, and can account for empirically-observed variation in attrition rates. Our findings provide mixed support for the algorithmic hypothesis. On the one hand, it appears that roughly half of all chains can be completed in 6-7 steps--thus supporting the "six degrees of separation" assertion--but on the other hand, estimates of the mean are much longer, suggesting that for at least some of the population, the world is not "small" in the algorithmic sense. We conclude that search distances in social networks are fundamentally different from topological distances, for which the mean and median of the shortest path lengths between nodes tend to be similar.

Behavioral Profiles for Advanced Email Features

691 Thomas Karagiannis and Milan Vojnovic

Abstract. We examine the behavioral patterns of email usage in a large-scale enterprise over a three-month period. In particular, we focus on two main questions: (Q1) what do replies depend on? and (Q2) what is the gain of augmenting contacts through the friends of friends from the email social graph? For Q1, we identify and evaluate the significance of several factors that affect the reply probability and the email response time. We find that all factors of our considered set are significant, provide their relative ordering, and identify the recipient list size, and the intensity of email communication between the correspondents as the dominant factors. We highlight various novel threshold behaviors and provide support for existing hypotheses such as that of the least-effort reply. For Q2, we find that the number of new contacts extracted from the friends-of-friends relationships amounts to a large number, but which is still a limited portion of the total enterprise size. We believe that our results provide significant insights towards informed design of advanced email features, including those of social-networking type.

A Measurement-driven Analysis of Information Propagation in the Flickr Social Network

753 Meeyoung Cha, Alan Mislove and Krishna Gummadi

Abstract. Online social networking sites like MySpace, Facebook and Flickr have become a popular way to share and disseminate content. Their massive popularity has led to viral marketing techniques that attempt to spread content, products, and political campaigns on these sites. However, there is little data publicly available on viral propagation in the real world and few studies have attempted to characterize how information spreads over current online social networks. In this paper, we collect and analyze large-scale traces of information dissemination in the Flickr social network. Our analysis, based on repeated crawls of the favorite markings of 11 million photos and the social network of 2.5 million users, attempts to answer three key questions: (a) how widely does information propagates in the social network? (b) how quickly does information propagate? and (c) what is the role of word-of-mouth exchanges between friends in the overall propagation of information in the network? Somewhat counter intuitively, we find that (a) even popular information does not spread widely throughout the network, (b) even popular photos spread very slowly through the network, and (c) information exchanged between friends is likely to account for over 50\% of all favorite bookmarks. Our findings stand in sharp contradiction to our initial expectation that information would spread widely and quickly in a viral fashion across the social network.

Web Security

Privacy-2 Gail-Joon Ahn

All Your Contacts Are Belong to Us: Automated Identity Theft

578 Leyla Bilge, Thorsten Strufe, Davide Balzarotti and Engin Kirda

Abstract. Social networking sites have been increasingly gaining popularity. Well- known sites such as Facebook have been reporting growth rates as high as 3% per week. Many social networking sites have millions of registered users who use these sites to share photographs, contact long-lost friends, establish new business contacts and to keep in touch. In this paper, we investigate how easy it would be for a potential attacker to launch automated crawling and identity theft (i.e., cloning) attacks against a number of popular social networking sites in order to gain access to a large volume of personal user information. The simplest attack we present consists of the automated identity theft of existing user profiles and the sending of friendship requests to the contacts of the cloned victim. The hope, from the attacker's point of view, is that the contacted users simply trust and accept the friendship request. By establishing a friendship relationship with the contacts of a victim, the attacker is able to access the sensitive personal information provided by them. In the second, more advanced attack we present, we show that it is effective and feasible to launch an automated, cross-site profile cloning attack. In this attack, we are able to automatically create a forged profile in a network where the victim is not registered yet and contact the victim's friends who are registered on both networks. Our experimental results with real users show that the automated attacks we present are effective and feasible in practice.

Using Static Analysis for Ajax Intrusion Detection

598 Arjun Guha, Shriram Krishnamurthi and Trevor Jim

Abstract. We present a static control-flow analysis for JavaScript programs running in a web browser. Our analysis tackles numerous challenges posed by modern web applications, including asynchronous communication, frameworks, and dynamic code generation. We use our analysis to extract a model of expected client behavior as seen from the server, and build an intrusion-prevention proxy for the server: the proxy intercepts client requests and disables those that do not meet the expected behavior. We insert random asynchronous requests to foil mimicry attacks. Finally, we evaluate our technique against several real applications and show that it protects against an attack in a widely-used web application.

A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval

745 Guang Xiang and Jason Hong

Abstract. Phishing is a significant security threat to the Internet, which causes tremendous economic loss every year. In this paper, we proposed a novel hybrid phish detection method based on state-of-the-art information extraction (IE) and information retrieval (IR) techniques. The identity-based component of our method detects phishing webpages by directly discovering the inconsistency between their identity and the identity they are imitating. The keywords-retrieval component utilizes IR algorithms exploiting the power of search engines to identify phish. Our method requires no training data, no prior knowledge of phishing signatures and specific implementations, and thus is able to adapt quickly to constantly appearing new phishing patterns. Comprehensive experiments over a diverse spectrum of data sources with 11449 pages show that both components have a low false positive rate and the stacked approach achieves a true positive rate of 90.06% with a false positive rate of 1.95%.

Web Architecture Aspect

Engine-3 Bebo White

Why is the Web Loosely Coupled? A Multi-Faceted Metric for Service Design

149 Cesare Pautasso and Erik Wilde

Abstract. Loose coupling is often quoted as a desirable property of systems architectures. One of the main goals of building systems using Web technologies is to achieve loose coupling. However, given the lack of a widely accepted definition of this term, it becomes hard to use coupling as a criterion to evaluate alternative Web technology choices, as all options may exhibit, and claim to provide, some kind of "loose" coupling effects. This paper presents a systematic study of the degree of coupling found in service-oriented systems based on a multi-faceted approach. Thanks to the metric introduced in this paper, coupling is no longer a one-dimensional concept with loose coupling found somewhere in between tight coupling and no coupling. The paper shows how the metric can be applied to real- world examples in order to support and improve the design process of service- oriented systems.

Highly Scalable Web Applications with Zero-Copy Data Transfer

505 Toyotaro Suzumura, Michiaki Tatsubori, Scott Trent, Akihiko Tozawa and Tamiya Onodera

Abstract. The performance of server side applications is becoming increasingly important as more applications exploit the web application model. Extensive work has been done to improve the performance of individual software components such as web servers and programming language runtimes. This paper describes a novel approach to boost web application performance by improving interprocess communication between the programming language runtime and operating system. The approach reduces redundant processing for memory copying and the context switch overhead between users space and kernel space by exploiting the zero-copy data transfer methodology such as the sendfile system call. In order to transparently utilize this optimization feature with existing web applications, we propose an enhancement of the PHP runtime, FastCGI protocol, and web server. Our proposed approach achieves a 126% performance improvement with micro-benchmarks, and a 22% performance improvement for the standard web benchmark, SPECweb2005.

REST-Based Management of Loosely Coupled Services

664 Heiko Ludwig, Jim Laredo, Kamal Bhattacharya, Liliana Pasquale and Bruno Wassermann

Abstract. Applications increasingly make use of the distributed platform that the World Wide Web provides Ð be it as a Software-as-a-Service such as, an application infrastructure such as, or a computing infrastructure such as a "cloud". A common characteristic of applications of this kind is that they are deployed on infrastructure or make use of components that reside in different management domains. Current service management approaches and systems, however, often rely on a centrally managed configuration management database (CMDB), which is the basis for centrally orchestrated service management processes, in particular change management and incident management. The distribution of management responsibility of WWW based applications requires a decentralized approach to service management. This paper proposes an approach of decentralized service management based on distributed configuration management and service process co-ordination, making use RESTful access to configuration information and ATOM-based distribution of updates as a novel foundation for service management processes.

Statistical Methods

Mining-4 Sergei Vassilvitskii

Latent Space Domain Transfer between High Dimensional Overlapping Distributions

230 Sihong Xie, Wei Fan, Jing Peng, Olivier Verscheure and Jiangtao Ren

Abstract. Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a diffeerent domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values" when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from out-domain are retrieved, and the classifcation is made by weighted voting using retrievd out-domain examples. We formally show that importing feature values across domains and latent semantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples.

StatSnowball: a Statistical Approach to Extracting Entity Relationships

236 Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang and Ji-Rong Wen

Abstract. Traditional relation extraction methods require pre-specified relations and relation-specific human-tagged examples. Bootstrapping systems significantly reduce the number of training examples, but they usually apply heuristic-based methods to combine a set of strict hard rules, which will cause limited generalizability and thus low recall. Furthermore, existing bootstrapping methods cannot perform open information extraction (Open IE), which can identify various types of relations without requiring pre-specifications. In this paper, we propose a statistical extraction framework called {\it Statistical Snowball} (StatSnowball), which is a bootstrapping system and can perform both traditional relation extraction and Open IE.
StatSnowball uses the discriminative Markov logic networks (MLNs) and softens hard rules by learning their weights in a maximum likelihood estimate sense. MLN is a general model, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection is performed by solving an $\ell_1$- norm penalized maximum likelihood estimation, which enjoys well-founded theories and efficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small but fully labeled data set and large-scale Web data. Empirical results show that StatSnowball can achieve a significantly higher recall without sacrificing the high precision during iterations with a small number of seeds; and the joint inference of MLN can improve the performance. Finally, StatSnowball is efficient and we have developed a working entity relation search engine called {\it Renlifang} based on it.

Large Scale Online Bayesian Recommendations

192 David Stern, Ralf Herbrich and Thore Graepel

Abstract. We present a probabilistic model for generating personalised recommendations of items to users of a web service. The system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional `trait space' in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don't like) and observation of a set of ordinal ratings on a user-specific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an items popularity, a user's taste or a user's personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of ~1,000,000 and ~100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.

Search UI

Search-1 Ziv Bar-Yossef

Efficient Interactive Fuzzy Keyword Search

729 Shengyue Ji, Guoliang Li, Chen Li and Jianhua Feng

Abstract. Traditional information systems return answers after a user submits a complete query. Users often feel ``left in the dark'' when they have limited knowledge about the underlying data, and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step towards solving this problem. In this paper, we study a new information-access paradigm, called ``interactive fuzzy search,'' in which the system searches the underlying data ``on the fly'' as the user types in query keywords. It extends autocomplete interfaces by (1) allowing keywords to appear in multiple attributes (in an arbitrary order) of the underlying data; and (2) finding relevant records that have keywords matching query keywords {\em approximately}. This framework allows users to explore data as they type, even in the presence of minor errors. We study research challenges in this framework for large amounts of data. Since each keystroke of the user could invoke a query on the backend, we need efficient algorithms to process each query within milliseconds. We develop various incremental-search algorithms using previously computed and cached results in order to achieve an interactive speed. We have deployed several real prototypes using these techniques. One of them has been deployed to support interactive search on the UC Irvine people directory, which has been used regularly and well received by users due to its friendly interface and high efficiency.

An Axiomatic Approach to Result Diversification

430 Sreenivas Gollapudi and Aneesh Sharma

Abstract. Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principle-based approach of presenting the most relevant results on top can be sub-optimal, and hence the search engine would like to trade-off relevance for diversity in the results.
In an analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evalutation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database.

Quicklink Selection for Navigational Query Results

806 Deepayan Chakrabarti, Ravi Kumar and Kunal Punera

Abstract. Web search engines leverage information from structured databases to answer queries. For example, many product related queries on search engines ({\em Amazon, Google, Yahoo, Live Search}) are answered by searching underlying product databases containing names, descriptions, specifications, and reviews of products. However, these vertical search engines are ``{\em silo-ed}'' in that their results are independent of those from web search. This often leads to empty or incomplete results, as query terms are matched against the information in the underlying database only. In order to overcome this problem, we propose an approach that first identifies relationships between web documents and items in structured databases. This allows us to subsequently exploit results from web search engines in combination with these relationships to obtain the structured data items relevant for a much wider range of queries. We propose an architecture that implements the integrated search functionality efficiently, adding very little additional overhead to query processing and is fully integrated with the search engine architecture. We demonstrate the quality of our techniques through an extensive experimental study.

Photos and Web 2.0

Social-4 Shilad Sen

Mapping the World's Photos

597 David Crandall, Lars Backstrom, Daniel Huttenlocher and Jon Kleinberg

Abstract. We investigate how to organize a large collection of geotagged photos, working with a dataset of about 20 million images collected from Flickr. Our approach combines content analysis based on text tags and image data with structural analysis based on geospatial data. We use the spatial distribution of where people take photos to define a relational structure between the photos that are taken at popular places. We then study the interplay between this structure and the content, using classification methods for predicting such locations from visual, textual and temporal features of the photos. We find that visual and temporal features improve the ability to estimate the location of a photo, compared to using just textual features. We illustrate using these techniques to organize a large photo collection, while also revealing various interesting properties about popular cities and landmarks at a global scale.

Ranking and Classifying Attractiveness of Photos in Folksonomies

584 Jose San Pedro and Stefan Siersdorfer

Abstract. Web 2.0 applications like Flickr, YouTube, or are increasingly popular online communities for creating, editing and sharing content and the growing size of these folksonomies poses new challenges in terms of search and mining. In this paper we introduce a novel methodology for automatically ranking and classifying photos according to their attractiveness for folksonomy members. To this end, we exploit image features known for having significant effects on the visual quality perceived by humans (e.g. sharpness and colorfulness) as well as textual meta data, in what is a multi-modal approach. Using feedback and annotations available in the Web 2.0 photo sharing system Flickr, we assign relevance values to the photos and train classification and regression models based on these relevance assignments. With the resulting machine learning models we categorize and rank photos according to their attractiveness. Applications include enhanced ranking functions for search and recommender methods for attractive content. Large scale experiments on a collection of Flickr photos demonstrate the viability of our approach.

Constructing Folksonomies from User-specified Relations on Flickr

500 Anon Plangprasopchok and Kristina Lerman

Abstract. Automatic folksonomy construction from tags has attracted much attention recently. However, inferring hierarchical relations between concepts from tags has a drawback in that it is difficult to distinguish between more popular and more general concepts. Instead of tags we propose to use user-specified relations for learning folksonomy. We explore two statistical frameworks for aggregating many shallow individual hierarchies, expressed through the collection/set relations on the social photosharing site Flickr, into a common deeper folksonomy that reflects how a community organizes knowledge. Our approach addresses a number of challenges that arise while aggregating information from diverse users, namely noisy vocabulary, and variations in the granularity level of the concepts expressed. Our second contribution is a method for automatically evaluating learned folksonomy by comparing it to a reference taxonomy, e.g., the Web directory created by the Open Directory Project. Our empirical results suggest that user- specified relations are a good source of evidence for learning folksonomies.

Client Side Web Engineering

Engine-4 Daniel Schwabe

Co-browsing dynamic web pages

420 Dietwig Lowet and Daniel Goergen

Abstract. Collaborative browsing, or co-browsing, is the co-navigation of the web with other people at-a-distance, supported by software that takes care of synchronizing the browsers. Current state-of-the-art solutions are able to do co- browsing of "static web pages", and do not support the synchronization of JavaScript interactions. Currently many web pages use JavaScript and Ajax techniques to create highly dynamic web applications. In this paper, we describe two approaches for co-browsing that both support the synchronization of the JavaScript and Ajax interactions of dynamic web pages. One approach is based on synchronizing the output of the JavaScript engine by sending over the changes made on the DOM tree. The other approach is based on synchronizing the input of the JavaScript engine by synchronizing UI events and incoming data. Since the latter solution offers a better user experience and is more scalable, it is elaborated in more detail. An important aspect of both approaches is that they operate at the DOM level. Therefore, the client-side can be implemented in JavaScript and no browser extensions are required. To the best of the authors' knowledge this is the first DOM-level co-browsing solution that also enables co-browsing of the dynamic interaction parts of web pages. The presented co-browsing solution has been implemented in a research demonstrator which allows users to do co-browsing of web-applications on browser-based networked televisions

HTML Templates that Fly - A Template Engine Approach to Automated Offloading from Server to Client

461 Michiaki Tatsubori and Toyotaro Suzumura

Abstract. Web applications often use HTML template engines to separate the webpage presentation from its underlying business logic and objects. This is now the de facto standard programming model for Web application development. This paper proposes a novel implementation for existing server-side template engines, MoveableTemplate, for (a) reduced bandwidth consumption in Web application servers, and (b) off-loading HTML generation tasks to Web clients. Instead of producing a fully-generated HTML page, the proposed template engine produces a skeletal script which includes only the dynamic values of the template parameters and the bootstrap code that runs on a Web browser at client side. It retrieves a client-side template engine and the payload templates separately. With the goals of efficiency, implementation transparency, security, and standard compliance in mind, we developed MoveableTemplate with two design principles: effective browser cache usage, and reasonable compromises which restrict template usage patterns and relax security policies slightly but explicitly. This approach allows most typical template-based Web applications to run effectively with MoveableTemplate. As an experiment, we tested the SPECweb2005 Banking application using MoveableTemplate without any other modifications and saw throughput improvements from 1.5x to 2.0x at its best mode. Moreover, MoveableTemplate makes applications comply with a specified security policy while conventional technologies for automatic client-server partitioning can pose security problems in the Web environment.

Characterizing Insecure JavaScript Practices on the Web

638 Chuan Yue and Haining Wang

Abstract. JavaScript is an interpreted programming language most often used for enhancing web page interactivity and functionality. It has powerful capabilities to interact with web page documents and browser windows, however, it has also opened the door for many browser-based security attacks. Insecure engineering practices of using JavaScript may not directly lead to security breaches, but can create new attack vectors and greatly increase the risks of browser-based attacks. In this paper, we present the first measurement study on insecure practices of using JavaScript on the web. Our focus is on the insecure practices of JavaScript inclusion and dynamic generation, and we examine their severity and nature on 6,805 unique web sites. Our measurement results reveal that insecure JavaScript practices are common at various web sites: (1) at least 66.4% of the measured web sites manifest the insecure practices of including JavaScript files from external domains into the top-level documents of their web pages; (2) over 44.4% of the measured web sites use the dangerous eval() function to dynamically generate and execute JavaScript code on their web pages; and (3) in JavaScript dynamic generation, using the document.write() method and innerHTML property is much more popular than using the relatively secure technique of creating script elements via DOM methods. Our analysis indicates that safe alternatives to these insecure practices exist in common cases and ought to be adopted by web site developers and administrators for reducing potential security risks.


Mining-5 Andrew Tomkins

Learning Consensus Opinion:Mining Data from a Labeling Game

556 Paul Bennett, Max Chickering and Anton Mityagin

Abstract. In this paper, we consider the challenge of how to identify the consensus opinion of a set of users as to how the results for a query should be ranked. Once consensus rankings are identified for a set of queries, these rankings can serve for both evaluation and training of retrieval and learning systems. We present a novel approach to collecting user preferences over image-search results: we use a collaborative game in which players are rewarded for agreeing on which image result is best for a query. Our approach is distinct from other labeling games because we are able to elicit directly the preferences of interest with respect to image queries extracted from query logs. As a source of relevance judgments, this data provides a useful complement to click data. Furthermore, it is free of positional biases and does not carry the risk of frustrating users with non-relevant results associated with proposed mechanisms for debiasing clicks. We describe data collected over 35 days from a deployed version of this game that amounts to about 19 million expressed preferences between pairs. Finally, we present several approaches to modeling this data in order to extract the consensus rankings from the preferences and better sort the search results for targeted queries.

Rated Aspect Summarization of Short Comments

448 Yue Lu, ChengXiang Zhai and Neel Sundaresan

Abstract. Web 2.0 technologies have enabled more and more people to freely comment on different kinds of entities (e.g. sellers, products, services). The large scale of information poses the need and challenge of automatic summarization. In many cases, each of the user generated short comments comes with an overall rating. In this paper, we study the problem of generating a ``rated aspect summarization'' of short comments, which is a decomposed view of the overall ratings for the major aspects so that a user could gain different perspectives towards the target entity. We formally define the problem and decompose the solution into three steps. We demonstrate the effectiveness of our methods by using eBay sellers' feedback comments. We also quantitatively evaluate each step of our methods and study how human agree on such summarization task. The proposed methods are quite general and can be used to generate rated aspect summary given any collection of short comments each associated with an overall rating.

How opinions are received by online communities: A case study on helpfulness votes

741 Cristian Danescu Niculescu-Mizil, Gueorgi Kossinets, Jon Kleinberg and Lillian Lee

Abstract. There are many on-line settings in which users publicly express opinions. A number of these offer mechanisms for other users to evaluate these opinions; a canonical example is, where reviews come with annotations like ``26 of 32 people found the following review helpful.'' Opinion evaluation appears in many off-line settings as well, including market research and political campaigns. Reasoning about the evaluation of an opinion is fundamentally different from reasoning about the opinion itself: rather than asking, ``What did Y think of X?'', we are asking ``What did Z think of Y's opinion of X?''
Here we develop a framework for analyzing and modeling opinion evaluation, using a large-scale collection of Amazon book reviews as a dataset. We find that the perceived helpfulness of a review is based not just on its content but also depends in subtle ways on its score relative to other scores for the same product. As part of our approach, we develop novel methods to control for the effects of text in opinion evaluation, and we provide a simple, natural mathematical model consistent with our findings. Our analysis also allows us to distinguish among the predictions of competing theories from sociology and social psychology, and to discover unexpected differences in the collective opinion evaluation behavior of user populations from different countries.

Query Processing

Search-2 Ricardo Baeza-Yates

Inverted Index Compression and Query Processing with Optimized Document Ordering

670 Hao yan, Shuai Ding and Torsten Suel

Abstract. Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies [] first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering.
We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of $25.2$ million web pages.

RuralCafe: Web Search in the Rural Developing World

722 Jay Chen, Lakshminarayanan Subramanian and Jinyang Li

Abstract. A majority of the world's population in rural developing regions do not have access to the World Wide Web. Traditional network connectivity technologies have proven to be prohibitively expensive in these areas. The emergence of new long-range wireless technologies provide hope for connecting these rural regions to the Internet. However, the network connectivity provided by these new solutions are by nature {\em intermittent} due to high network usage rates, frequent power-cuts and the use of delay tolerant links.
Typical applications, especially interactive applications such as web search, cannot work in the face of intermittent connectivity. In this paper, we present the design and implementation of {\em RuralCafe}, a system intended to support efficient web search over intermittent networks. RuralCafe enables users to perform web search asynchronously and find what they are looking for in {\em one round of intermittency} as opposed to multiple rounds of search/downloads. RuralCafe does this by providing an expanded search query interface which allows a user to specify additional query terms to maximize the utility of the results returned by a search query. Given knowledge of the limited available network resources, RuralCafe performs optimizations to prefetch pages to best satisfy a search query based on a user's search preferences. In addition, RuralCafe does not require modifications to the web browser, and can provide single round search results tailored to various types of networks and economic constraints. We have implemented and evaluated the effectiveness of RuralCafe using queries from logs made to a large search engine, queries made by users in an intermittent setting, and live queries from a small testbed deployment.

Using Graphics Processors for High Performance IR Query Processing

699 Shuai Ding, Jinru He, Hao Yan and Torsten Suel

Abstract. Web search engines are facing formidable performance challenges due to data sizes and query loads. The major engines have to process tens of thousands of queries per second over tens of billions of documents. To deal with this heavy workload, such engines employ massively parallel systems consisting of thousands of machines. The significant cost of operating these systems has motivated a lot of recent research into more efficient query processing mechanisms. We investigate a new way to build such high performance IR systems using graphical processing units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains such as databases and scientific computing. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, to develop suitable algorithms for subtasks such as inverted list compression, list intersection, and top-k scoring, and to show how to achieve highly efficient query processing on GPU-based systems. Our experimental results for a prototype GPU-based system on 25:2 million web pages indicate that significant gains in query processing performance can be obtained over a state-of-the-art CPU-based system.

Media Applications

RichMedia-1 Fausto Giunchiglia

Less Talk, More Rock: Automated Organization of Community-Contributed Collections of Concert Videos

330 Lyndon Kennedy and Mor Naaman.

Abstract. We describe a system for synchronization and organization of user-contributed content from live music events. We start with a set of short video clips taken at a single event by multiple contributors, who were using a varied set of cap- ture devices. Using audio ngerprints, we synchronize these clips such that overlapping clips can be displayed simulta- neously. Furthermore, we use the timing and link structure generated by the synchronization algorithm to improve the ndability and representation of the event content, including identifying key moments of interest and descriptive text for important captured segments of the show. We also identify the preferred audio track when multiple clips overlap. We thus create a much improved representation of the event that builds on the automatic content match. Our work demon- strates important principles in the use of content analysis techniques for social media content on the Web, and applies

A Generalised Cross-Modal Clustering Method Applied to Multimedia News Semantic Indexing and Retrieval

111 Alberto Messina and Maurizio Montagnuolo

Abstract. The evolution of Web services has enabled the distribution of informative content through dynamic media such as RSS feeds. In addition, the availability of the same informative content in the form of digital multimedia data has dramatically increased. Robust solutions for semantic aggregation of heterogeneous data streams are needed to efficiently access desired information from this variety of information sources. To this end, we present a novel approach for cross-media information sources aggregation, and describe a prototype system implementing this approach. The prototype adopts online newspaper articles and TV newscasts as information sources, to deliver a service made up of items including both contributions. Extensive experiments prove the effectiveness of the proposed approach in a real-world business context.

What Makes Conversations Interesting? Themes, Participants and Consequences of Conversations in Online Social Media

603 Munmun De Choudhury, Hari Sundaram, Ajita John and Doree Seligmann

Abstract. Rich media social networks promote not only creation and consumption of media, but also communication about the posted media item. What causes a conversation to be interesting, that prompts a user to participate in the discussion on a posted video? We conjecture that people will participate in conversations when they find the conversation theme interesting, see comments by people that are known to them or observe an engaging dialogue between two or more people is engaging (an absorbing back and forth between two people). Importantly, a conversation that is deemed interesting must be consequential Ð i.e. it must impact the social network itself. Our framework has three parts: characterizing themes, characterizing participants for determining interestingness and measures of consequences of a conversation deemed to be interesting. First, we detect conversational themes using a sophisticated mixture model approach. Second, we determine interestingness of participants and interestingness of conversations based on a random walk model. Third, we measure the consequence of a conversation by measuring the mutual information of the interesting property with three variables that should be affected by an interesting conversation Ð participation in related themes, participant cohesiveness and theme diffusion. We have conducted extensive experiments using dataset from the popular video sharing site, YouTube. Our results show that our method of interestingness maximizes the mutual information, and is significantly better (twice as large) than three other baseline methods (number of comments, number of new participants and PageRank based assessment).

Mobile Web

UserIface-1 Rittwik Jana

Mining Interesting Locations and Travel Sequences from GPS Trajectories for Mobile Users

120 Yu Zheng, Lizhu Zhang, Xing Xie and Wei-Ying Ma

Abstract. The increasing availability of GPS-enabled devices is changing the way people interact with the Web, and brings us a large amount of GPS trajectories representing people's location histories. In this paper, based on multiple users' GPS trajectories, we aim to mine interesting locations and classical travel sequences in a given geospatial region. Here, interesting locations mean the culturally important places, such as Tiananmen Square in Beijing, and frequented public areas, like shopping malls and restaurants, etc. Such information can help users understand surrounding locations, and would enable travel recommendation. In this work, we first model multiple individuals' location histories with a tree-based hierarchical graph (TBHG). Second, based on the TBHG, we propose a HITS (Hypertext Induced Topic Search)-based inference model, which regards an individual's access on a location as a directed link from the user to that location. This model infers the interest of a location by taking into account the following three factors. 1) The interest of a location depends on not only the number of users visiting this location but also these users' travel experiences. 2) Users' travel experiences and location interests have a mutual reinforcement relationship. 3) The interest of a location and the travel experience of a user are relative values and are region-related. Third, we mine the classical travel sequences among locations considering the interests of these locations and users' travel experiences. We evaluated our system using a large GPS dataset collected by 107 users over a period of one year in the real world. As a result, our HITS-based inference model outperformed baseline approaches like rank-by-count and rank-by-frequency. Meanwhile, when considering the users' travel experiences and location interests, we achieved a better performance beyond baselines, such as rank-by-count and rank-by-interest, etc.

Computers and iPhones and Mobile Phones, oh my! A logs-based comparison of search users on different devices

70 Maryam Kamvar, Melanie Kellar, Rajan Patel and Ya Xu.

Abstract. We present a logs-based comparison of search patterns across three platforms: computers, iPhones and conventional mobile phones. Our goal is to understand how mobile search users differ from computer-based search users, and we focus heavily on the distribution and variability of tasks that users perform from each platform. The results suggest that search usage is much more focused for the average mobile user than for the average computer-based user. However, search behavior on high-end phones resembles computer-based search behavior more so than mobile search behavior. A wide variety of implications follow from these findings. First, there is no single search interface which is suitable for all mobile phones. We suggest that for the higherend phones, a close integration with the standard computer-based interface (in terms of personalization and available feature set) would be beneficial for the user, since these phones seem to be treated as an extension of the users' computer. For all other phones, there is a huge opportunity for personalizing the search experience for the user's "mobile needs", as these users are likely to repeatedly

A Game Based Approach to Assign Geographical Relevance to Web Images

227 Yuki Arase, Xing Xie, Manni Duan, Takahiro Hara and Shojiro Nishio

Abstract. Geographical context is very important for images. Millions of images on the Web have been already assigned latitude and longitude information. Due to the rapid proliferation of such images with geographical context, it is still difficult to effectively search and browse them, since we do not have ways to decide their relevance. In this paper, we focus on the geographical relevance of images, which is defined as to what extent the main objects in an image match landmarks at the location where the image was taken. Recently, researchers have proposed to use game based approaches to label large scale data such as Web images. However, there is no in-depth study on the quality of collected game logs and how the logs can improve existing applications. To answer these questions, we design and implement a Web-based and multi-player game to collect human knowledge while people are enjoying the game. Then we thoroughly analyze the game logs obtained during a three week study with 147 participants and propose methods to determine the image geographical relevance. In addition, we conduct an experiment to compare our methods with a commercial search engine. Experimental results show that our methods dramatically improve image search relevance. Furthermore, we show that we can derive geographically relevant objects and their salient portion in images, which is valuable for a number of applications such as image location recognition.


Mining-7 Masashi Toyoda

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums

397 Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang and Wei-Ying Ma

Abstract. Web forum has become an important data resource for many Web applications, but extracting structured data from unstructured Web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various Web forum sites. Our target is to find a solution as general as possible to extract structured data such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods which only leverage the knowledge inside an individual page, we incorporate both the page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidences by learning their importance automatically. The site-level knowledge includes (1) the linkages among different object pages such as list pages and post pages, and (2) the interrelationships of pages belonging to one same object. The experimental results on 20 forums show very encouraging performance of information extraction, and demonstrate the generalization ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while incorporating the site-level knowledge both precision and recall can be significantly improved.

Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs

266 Huanhuan Cao, Daxin Jiang, Jian Pei, Enhong Chen and Hang Li

Abstract. Capturing the context of a user's query from the previous queries and clicks in the same session leads to better understanding of the user's information need. A context-aware approach to document re-ranking, query suggestion, and URL recommendation may improve users' search experience substantially. In this paper, we propose a general approach to context-aware search. To capture contexts of queries, we learn a variable length Hidden Markov Model (vlHMM) from search sessions extracted from log data. Although the mathematical model is intuitive, how to learn a large vlHMM with millions of states from hundreds of millions of search sessions poses a grand challenge. We develop a strategy for parameter initialization in vlHMM learning which can greatly reduce the number of parameters to be estimated in practice. We also devise a method for distributed vlHMM learning under the map-reduce model. We test our approach on a real data set consisting of 1.8 billion queries, 2.6 billion clicks, and 840 million search sessions, and evaluate the effectiveness of the vlHMM learned from the real data on three search applications: document re-ranking, query suggestion, and URL recommendation. The experimental results clearly show that our context- aware approach is both effective and efficient.

A Class-Feature-Centroid Classifier For Text Categorization

301 Guan Hu, Jingyu Zhou and Minyi Guo

Abstract. Automated text categorization is an important technique for many web applications, such as document indexing, document filtering, and cataloging web resources. Many different approaches have been proposed for the automated text categorization problem. Among them, centroid-based approaches have the advantages of short training time and testing time due to its computational efficiency. As a result, centroid-based classifiers have been widely used in many web applications. However, the accuracy of centroid-based classifiers is inferior to SVM, mainly because centroids found during the training process are far from perfect locations.
We design a fast Class-Feature-Centroid (CFC) classifier for multi-class, single-label text categorization. In CFC, a centroid is built from two important class features: inter-class term distribution and inner-class term distribution. CFC proposes a novel combination of these features and employs a denormalized cosine measure to calculating the similarity between a text vector and a centroid. Experiments on the Reuters-21578 corpus and 20-newsgroup email collection show that CFC consistently outperforms the state-of-the-art SVM classifiers on both micro-F1 and macro-F1 scores. Particularly, CFC is more effective and robust than SVM when the training data is sparse.

Large Scale Multi-Label Classification via MetaLabeler

686 Lei Tang, Suju Rajan and Vijay Narayanan

Abstract. The explosion of online content has made the management of such content non-trivial. Web-related tasks such as web page categorization, news filtering, query categorization, tag recommendation, etc. often involve the construction of multi-label categorization systems on a large scale. Existing multi- label classification methods either do not scale or have unsatisfactory performance. In this work, we propose MetaLabeler to automatically determine the relevant set of labels for each instance without intensive human involvement or expensive cross- validation. Extensive experiments conducted on benchmark data show that the MetaLabeler tends to outperform existing methods. Moreover, MetaLabeler scales to millions of multi-labeled instances and can be deployed easily. This enables us to apply the MetaLabeler to a large scale query categorization problem in Yahoo!, yielding a significant improvement in performance.

Web Privacy

Privacy-1 David Lowe

Collective Privacy Management in Social Networks

667 Aanna Squicciarini, Mohamed Shehab and Federica Paci

Abstract. Social Networking is one of the major technological phenomena of the Web 2.0, with hundreds of millions of people participating. Social networks enable a form of self expression for users, and help them to socialize and share content with other users. In spite of the fact that content sharing represents one of the prominent features of existing Social Network sites, Social Networks yet do not support any mechanism for collaborative management of privacy settings for shared content. In this paper, we model the problem of collaborative enforcement of privacy policies on shared data by using game theory. In particular, we propose a solution that offers automated ways to share images based on an extended notion of content ownership. Building upon the Clarke-Tax mechanism, we describe a simple mechanism that promotes truthfulness, and that rewards users who promote co- ownership. We integrate our design with inference techniques that free the users from the burden of manually selecting privacy preferences for each picture. To the best of our knowledge this is the first time such a protection mechanism for social networking has been proposed. In the paper, we also show a proof-of-concept application, which we implemented in the context of Facebook, one of today's most popular social networks. We show that supporting these type of solutions is not also feasible, but can be implemented through a minimal increase in overhead to end-users.

To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles

660 Elena Zheleva and Lise Getoor

Abstract. In order to address privacy concerns, many social media websites allow users to hide their personal profiles from the public. In this work, we show how an adversary can exploit an online social network with a mixture of public and private user profiles to predict the private attributes of users. We map this problem to a relational classification problem and we propose practical models that use friendship and group membership information (which is often not hidden) to infer sensitive attributes. The key novel idea is that in addition to friendship links, groups can be carriers of significant information. We show that on several well-known social media sites, we can easily and accurately recover the information of private-profile users. To the best of our knowledge, this is the first work that uses link-based and group- based classification to study privacy implications in social networks with mixed public and private user profiles.

Privacy Diffusion on the Web: A Longitudinal Perspective

587 Balachander Krishnamurthy and Craig Wills

Abstract. For the last few years we have been studying the diffusion of privacy for users as they visit various Web sites triggering data gathering aggregation by third parties. This paper reports on our longitudinal study consisting of multiple snapshots of our examination of such diffusion over four years. We examine the various technical ways by which third-party aggregators acquire data and the depth of user-related information acquired. We study techniques for protecting privacy diffusion as well as limitations of such techniques. We introduce the concept of secondary privacy damage.
Our results show increasing aggregation of user-related data by a steadily decreasing number of entities. A handful of companies are able to track users' movement across almost all of the popular Web sites. Virtually all the protection techniques have significant limitations highlighting the seriousness of the problem and the need for alternate solutions.

Tagging and Clustering

RichMedia-2 Doree Seligmann

Visual diversification of image search results

184 Reinier H. van Leuken, Lluís Garcia, Ximena Olivares and Roelof van Zwol

Abstract. Due to the reliance on the textual information associated with an image, image search engines on the Web lack the discriminative power to deliver visually diverse search results. The textual descriptions are key to retrieve relevant results for a given user query, but at the same time provide little information about the rich image content.
In this paper we investigate three methods for visual diversification of image search results. The methods deploy lightweight clustering techniques in combination with a dynamic weighting function of the visual features, to best capture the discriminative aspects of the resulting set of images that is retrieved. A representative image is selected from each cluster, which together form a diverse result set.
Based on a performance evaluation we find that the outcome of the methods closely resembles human perception of diversity, which was established in an extensive clustering experiment carried out by human assessors.

Tag Ranking

306 Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang and Hong-Jiang Zhang

Abstract. Social media sharing web sites like Flickr allow users to annotate images with free tags, which significantly facilitate Web image search and organization. However, the tags associated with an image generally are in a random order without any importance or relevance information, which limits the effectiveness of these tags in search and other applications. In this paper, we propose a tag ranking scheme, aiming to automatically rank the tags associated with a given image according to their relevance to the image content. We first estimate initial relevance scores for the tags based on probability density estimation, and then perform a random walk over a tag similarity graph to refine the relevance scores. Experimental results on a 50, 000 Flickr photo collection show that the proposed tag ranking method is both effective and efficient. We also apply tag ranking into three applications: (1) tag-based image search, (2) tag recommendation, and (3) group recommendation, which demonstrates that the proposed tag ranking approach really boosts the performances of social-tagging related applications.

Learning to Tag

728 Lei Wu, Linjun Yang, Nenghai YU and Xian-Sheng Hua

Abstract. Social tagging provides valuable and crucial information for large-scale web image retrieval. It is ontology-free and easy to obtain; however, noisy tags frequently appear, and users typically will not tag all semantic objects in the image, which is also called semantic loss. To avoid noises and compensate for the semantic loss, tag recommendation is proposed in literature. However, current recommendation simply ranks the related tags based on the single modality of tag co-occurrence on the whole dataset, which ignores other modalities, such as visual correlation. This paper proposes a multi-modality recommendation based on both tag and visual correlation, and formulates the tag recommendation as a learning problem. Each modality is used to generate a ranking feature, and Rankboost algorithm is applied to learn an optimal combination of these ranking features from different modalities. Experiments on Flickr data demonstrate the effectiveness of this learning-based multi-modality recommendation strategy.

User Interfaces

UserIface-2 Robin Chen

Web 2.0: Blind to an Accessible New World

513 Joshua Hailpern, Loretta Guarino Reid, Richard Boardman and Srinivas Annam

Abstract. With the advent of Web 2.0 technologies, Websites have evolved from static pages to dynamic, interactive Web-based applications with the ability to replicate common desktop functionality. However, for blind and visually impaired individuals who rely upon screen readers, Web 2.0 applications force them to adapt to an inaccessible use model. Many technologies, including WAI-ARIA, AJAX, and improved screen reader support, are rapidly coming together. However, simply combining them does not solve the problems of screen reader users. The main contributions of this paper are two models of interaction for screen reader users, for both traditional Websites and Web 2.0 applications. Further contributions are a discussion of accessibility difficulties screen reader users encounter when interacting with Web 2.0 applications, a user workflow design model for improving Web 2.0 accessibility, and a set of design requirements for developers to ease the user's burden and increase accessibility. These models, accessibility difficulties, and design implications are based directly on responses and lessons learned from usability research focusing on Web 2.0 usage and screen reader users. Without the conscious effort of Web engineers and designers, most blind and visually impaired users will shy away from using new Web 2.0 technology in favor of desktop based applications.

Scrolling Behaviour with Single- and Multi-column Layout

554 Cameron Braganza, Kim Marriott, Peter Moulder, Michael Wybrow and Tim Dwyer

Abstract. The standard layout model used by web browsers is to lay text out in a vertical scroll using a single column. The horizontal-scroll layout model - in which text is laid out in columns whose height is set to that of the browser and the viewer scrolls horizontally - seems well-suited to multi-column layout on electronic devices. We describe a study that examines how people read and in particular the strategies they use for scrolling with these two models when reading large textual documents on a standard computer monitor. We compare usability of the models and evaluate both user preferences and the effect of the model on performance. Also interesting is the description of the browser and its user interface which we used for the study.

What's Up CAPTCHA? A CAPTCHA Based on Image Orientation

97 Rich Gossweiler, Maryam Kamvar and Shumeet Baluja

Abstract. We present a new CAPTCHA where users adjust randomly rotated images to their upright orientation. Successfully determining the upright orientation requires the ability to analyze the complex contents of an image; a task that humans usually perform well and machines generally perform poorly. Given a large repository of images, such as those from a web search result, we use a suite of automated orientation detectors to prune those that can be easily set upright. We then apply a social feedback mechanism to verify that the remaining images have a humanrecognizable upright orientation. The main advantages of our CAPTCHA technique over the traditional text recognition technique are that it is language-independent, does not require text-entry (e.g. for a mobile device) and provides another domain for CAPTCHA generation beyond character obfuscation. This CAPTCHA lends itself to rapid implementation and has an almost limitless supply of images. We conducted extensive experiments regarding the viability of this technique and achieve positive results.

Web Mining

Mining-6 Dunja Mladenic

Mining the Web to Facilitate Fast and Accurate Approximate Match

605 Surajit Chaudhuri, Venkatesh Ganti and Dong Xin

Abstract. Tasks relying on recognizing entities have recently received significant attention in the literature. Many such tasks assume the existence of reference entity tables. In this paper, we consider the problem of determining whether a candidate string approximately matches with a reference entity. This problem is important for extracting named entities such as products or locations from a reference entity table, or matching entity entries across heterogenous sources. Prior approaches have relied on string-based similarity which only compare a candidate string and an entity it matches with. In this paper, we observe that considering such evidence across multiple documents significantly improves the accuracy of matching. We develop efficient techniques which exploit web search engines to facilitate approximate matching in the context of our proposed similarity functions. In an extensive experimental evaluation, we demonstrate the accuracy and efficiency of our techniques.

SmartMiner: A New Framework for Mining Large Scale Web Usage Data

618 Murat Ali Bayir, Ismail Toroslu, Ahmet Cosar and Guven Fidan

Abstract. In this paper, we propose a novel framework called Smart- Miner for web usage mining problem which uses link information for producing accurate user sessions and frequent navigation patterns. Unlike the simple session concepts in the time and navigation based approaches, where sessions are sequences of web pages requested from the server or viewed in the browser, in Smart-Miner sessions are set of paths traversed in the web graph that corresponds to users' navigations among web pages. We have modeled session reconstruction as a new graph problem and utilized a new algorithm, Smart-SRA, to solve this problem efficiently. For the pattern discovery phase, we have developed an efficient version of the Apriori-All technique which uses the structure of web graph to increase the performance. From the experiments that we have performed on both real and simulated data, we have observed that Smart-Miner produces at least 30%more accurateweb usage patterns than other approaches including previous session construction methods. We have also studied the effect of having the referrer information in the web server logs to show that different versions of Smart-SRA produce similar results. Another novel work is that we have implemented distributed version of the Smart Miner framework by employing Map-Reduce paradigm which enables processing huge size web server logs belonging to multiple web sites. To the best of our knowledge this paper is the first attempt to propose such large scale framework forweb usage mining problem. We conclude that we can efficiently process terabytes of web server logs belonging to multiple web sites by employing our scalable framework.

Releasing Search Queries and Clicks Privately

772 Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra and Alexandros Ntoulas

Abstract. The question of how to publish an anonymized search log was brought to the forefront by a well-intentioned, but privacy-unaware AOL search log release. Since then a series of ad-hoc techniques have been proposed in the literature, though none are known to be provably private. In this paper, we take a major step towards a solution: we show how queries, clicks and their associated perturbed counts can be published in a manner that rigorously preserves privacy. Our algorithm is decidedly simple to state, but non-trivial to analyze. On the opposite side of privacy is the question of whether the data we can safely publish is of any use. Our findings offer a glimmer of hope: we demonstrate that a non-negligible fraction of queries and clicks can indeed be safely published via a collection of experiments on a real search log. In addition, we select an application, finding similar queries, and show that the similar queries discovered on the original data resemble those found on the perturbed data.

Caching and Indices

Search-3 Marc Najork

Improved Techniques for Result Caching in Web Search Engines

676 Qingqing Gan and Torsten Suel

Abstract. Query processing is a major cost factor in operating large web search engines. In this paper, we study query result caching, one of the main techniques used to optimize query processing performance. Our first contribution is a study of result caching as a weighted caching problem in depth. Most previous work has focused on optimizing cache hit rates, but given that processing costs of queries can vary very significantly we argue that total cost savings also need to be considered. We describe and evaluate several algorithms for weighted result caching, and study the impact of Zipf-based query distributions on result caching. Our second and main contribution is a new set of feature-based cache eviction policies that achieve significant improvements over all previous methods, substantially narrowing the existing performance gap to the theoretically optimal (clairvoyant) method. Finally, using the same approach, we also obtain performance gains for the related problem of inverted list caching.

Nearest-Neighbor Caching for Content-Match Applications

528 Andrei Broder, Flavio Chierichetti, Vanja Josifovski, Ravi Kumar, Sandeep Pandey and Sergei Vassilvitskii

Abstract. Motivated by contextual advertising systems and other web applications involving efficiency-accuracy tradeoffs, we study similarity caching. Here, a cache hit is said to occur if the requested item is similar but not necessarily equal to some cached item. We study two objectives that dictate the efficiency-accuracy tradeoff and provide our caching policies for these objectives. By conducting extensive experiments on real data we show similarity caching can significantly improve the efficiency of contextual advertising systems, with minimal impact on accuracy. Inspired by the above, we propose a simple generative model that embodies two fundamental characteristics of page requests arriving to advertising systems, namely, long-range dependences and similarities. We provide theoretical bounds on the gains of similarity caching in this model and demonstrate these gains empirically by fitting the actual data to the model.

Compressed web indexes

511 Flavio Chierichetti, Ravi Kumar and Prabhakar Raghavan

Abstract. Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or similar power laws), since these are the distributions that arise on the web.
We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that in reality, the distributions of terms appear to follow a double-Pareto law rather than Zipf's law. Index sizes observed in the experiments conform well to our theoretical predictions.

Web Monetization

Moneti-2 David Stern

Adaptive Bidding for Display Advertising

632 Arpita Ghosh, Benjamin Rubinstein, Sergei Vassilvitskii and Martin Zinkevich

Abstract. Motivated by the emergence of auction-based marketplaces for display ads such as the Right Media Exchange, we study the design of a bidding agent that implements a display advertising campaign by bidding in such a marketplace. The bidding agent must acquire a given number of impressions with a given target spend, when the highest external bid in the marketplace is drawn from an {\em unknown} distribution $\cP$. The quantity and spend constraints arise from the fact that display ads are usually sold on a CPM basis. We consider both the full information setting, where the winning price in each auction is announced publicly, and the partially observable setting where only the winner obtains information about the distribution; these differ in the penalty incurred by the agent while attempting to learn the distribution. We provide algorithms for both settings, and prove performance guarantees using bounds on uniform closeness from statistics, and techniques from online learning. We experimentally evaluate these algorithms: both algorithms perform very well with respect to both target quantity and spend; further, our algorithm for the partially observable case performs nearly as well as that for the fully observable setting despite the higher penalty incurred during learning.

How much the Behavioral Targeting can Help Online Advertising?

537 Jun Yan, Ning Liu, Gang Wang, Wen Zhang, Yun Jiang and Zheng Chen

Abstract. Behavioral Targeting (BT) attempts to deliver the most relevant advertisements to the most interested audiences, and is playing an increasingly important role in online advertising market. However, there have been not any public works investigating on how much the BT can truly help online advertising in commercial search engines? To answer this question, in this paper we provide an empirical study on the ads click-through log collected from a commercial search engine. From the comprehensively experimental results on the sponsored search log of a commercial search engine over a period of seven days, we can draw three important conclusions: (1) Users who clicked the same ad will truly have similar behaviors on the Web; (2) The Click-Through Rate (CTR) of an ad can be averagely improved as high as 670% by properly segmenting users for behavioral targeted advertising; (3) Using the short term user behaviors to represent users is more effective than using the long term user behaviors for BT. The statistical t-test verifies that all conclusions drawn in the paper are statistically significant. To our best knowledge, this work is the first empirical study for BT on real world ads click- through log in academia.

Web Service Derivatives

752 Thomas Meinl and Benjamin Blau

Abstract. Abstract: Web service development and usage has shifted from simple information processing services to high-value business services that are crucial to productivity and success. In order to deal with an increasing risk of unavailability or failure of mission-critical Web services we argue the need for advanced reservation of services in the form of derivatives.
The contribution of this paper is twofold: First we provide an abstract model of a market design that enables the trade of derivatives for mission-critical Web services. Our model satisfies requirements that result from service characteristics such as intangibility and the impossibility to inventor services in order to meet fluctuating demand. It comprehends principles from models of incomplete markets such as the absence of a tradeable underlying and consistent arbitrage-free derivative pricing.
Furthermore we provide an architecture for a Web service market that implements our model and describes the strategy space and interaction of market participants in the trading process of service derivatives. We compare the underlying pricing processes to existing derivative models in energy exchanges, discuss eventual shortcomings, and apply Wavelets to analyze actual data and extract long- and short-term trends.

Semantic Data Management

Seman-1 Eduardo Mena

Rapid Semantic Web Mashup Development through Semantic Web Pipes

160 Danh Le Phuoc, Axel Polleres, Giovanni Tummarello, Christian Morbidoni and Manfred Hauswirth

Abstract. The use of RDF data published on the Web for applications is still a cumbersome and resource-intensive task due to the limited software support and the lack of standard programming paradigms to deal with everyday problems such as combination of RDF data from di erent sources, object identi er consolidation, ontology alignment and mediation or plain querying and processing tasks. While in a lot of other areas such tasks are supported by excellent libraries and component- oriented toolboxes of basic processing functionalities, RDF-basedWeb applications are still largely customized programs for a speci c purpose, with little potential for reuse. This increases development costs and incurs a more error-prone development process. Speaking in software engineering terms, this means that a good standard architectural style with good support for rapid application development is still missing. In this paper we present a framework based on the classical abstraction of pipes which tries to remedy this problem and support the fast implementation of software, while preserving desirable properties such as abstraction, encapsulation, component-orientation, code re-usability and maintainability, which are common and well supported in other application areas.

idMesh: Graph-Based Disambiguation of Linked Data

701 Philippe Cudre-Mauroux, Parisa Haghani, Michael Jost, Karl Aberer and Hermann de Meer

Abstract. We tackle the problem of disambiguating entities on the Web. We propose a user-driven scheme where graphs of entities -- represented by globally identifiable declarative artifacts -- self-organize in a dynamic and probabilistic manner. Our solution has the following two desirable properties: i) it lets end-users freely define associations between arbitrary entities and ii) it probabilistically infers entity relationships based on uncertain links using constraint-satisfaction mechanisms. We outline the interface between our scheme and the current data Web, and show how higher-layer applications can take advantage of our approach to enhance search and update of information relating to online entities. We describe a decentralized infrastructure supporting efficient and scalable entity disambiguation and demonstrate the practicability of our approach in a deployment over several hundreds of machines.

OpenRuleBench: An Analysis of the Performance of Rule Engines

31 Senlin Liang, Paul Fodor, Hui Wan and Michael Kifer

Abstract. The Semantic Web initiative has led to an upsurge of the interest in rules as a general and powerful way of processing, combining, and analyzing semantic information. Since several of the technologies underlying rule-based systems are already quite mature, it is important to understand how such systems might perform on the Web scale. OpenRuleBench is a suite of benchmarks for analyzing the performance and scalability of different rule engines. Currently the study spans five different technologies and eleven systems, but OpenRuleBench is an open community resource, and contributions from the community are welcome. In this paper, we describe the tested systems and technologies, the methodology used in testing, and analyze the results.


Perfor-1 Arun Iyengar

Efficient Application Placement in a Dynamic Hosting Platform

194 Zakaria Al-Qudah, Hussein Alzoubi, Mark Allman, Michael Rabinovich and Vincenzo Liberatore

Abstract. Web hosting providers are increasingly looking into dynamic hosting to reduce costs and improve the performance of their platforms. Instead of provisioning fixed resources to each customer, dynamic hosting maintains a variable number of application instances to satisfy current demand. While existing research in this area has mostly focused on the algorithms that decide on the number and location of application instances, we address the problem of efficient enactment of these decisions once they are made. We propose a new approach to application placement and experimentally show that it dramatically reduces the cost of application placement, which in turn improves the end-to-end agility of the hosting platform in reacting to demand changes.

Network Aware Forward Caching

650 Jeffrey Erman, Alexandre Gerber, Oliver Spatscheck, Dan Pei and MohammadTaghi Hajiaghayi

Abstract. This paper proposes and evaluates a Network Aware Forward Caching approach for determining the optimal deployment strategy of forward caches to a network. A key advantage of this approach is that we can reduce the costs associated with forward caching to maximize the benefit obtained from their deployment. We show in our simulation that a 37% increase to net benefits could be achieved over the standard method of full cache deployment to cache all POPs traffic. In addition, we show that this maximal point occurs when only 68% of the total traffic is cached. Another contribution of this paper is the analysis we use to motive and evaluate this problem. We characterize the Internet traffic of 100K subscribers of a US residential broadband provider. We use both layer 4 and layer 7 analysis to investigate the traffic volumes of the flows as well as study the general characteristics of the applications used. We show that HTTP is a dominant protocol and account for 68% of the total downstream traffic and that 34% of that traffic is multimedia. In addition, we show that multimedia content using HTTP exhibits a 83% annualized growth rate and other HTTP traffic has a 53% growth rate versus the 26% over all annual growth rate of broadband traffic. This shows that HTTP traffic will become ever more dominent and increase the potential caching opportunities. Furthermore, we characterize the core backbone traffic of this broadband provider to measure the efficiency content and traffic is delivered. We find that CDN traffic is much more efficient than P2P content and that there is large skew in the Air Miles between POP in a typical network. Our findings show that there are many opportunties in broadband provider networks to optimize how traffic is delivered and cached.

Anycast-Aware Transport for Content Delivery Networks

680 Zakaria Al-Qudah, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck and Kobus van der Merwe

Abstract. Anycast-based content delivery networks (CDNs) have many properties that make them ideal for the large scale distribution of content on the Internet. However, because routing changes can result in a change of the endpoint that terminates the TCP session, TCP session disruption remains a concern for anycast CDNs, especially for large file down-loads. In this paper we demonstrate that this problem does not require any complex solutions. In particular, we present the design of a simple, yet efficient, mechanism to handle session disruptions due to endpoint changes. With our mechanism, a client can continue the download of the content from the point at which it was before the endpoint change. Furthermore, CDN servers purge the TCP connection state quickly to handle frequent switching with low system overhead. We demonstrate experimentally the effectiveness of our proposed mechanism and show that more complex mechanisms are not required. Specifically, we find that our mechanism maintains high download throughput even with a reasonably high rate of endpoint switching, which is attractive for load balancing scenarios. Moreover, our results show that edge servers can purge TCP connection state after a single timeout-triggered retransmission without any tangible impact on ongoing connections. Besides improving server performance, this behavior improves the resiliency of the CDN to certain denial of service attacks.

Query Categorization

Search-4 Evgeniy Gabrilovich

Unsupervised Query Categorization using Automatically-Built Concept Graphs

341 Eustache Diemert and Gilles Vandelle

Abstract. Automatic categorization of user queries is an important component of general purpose (Web) search engines, particularly for triggering rich, query- specific content and sponsored links. We propose an unsupervised learning scheme that reduces dramatically the cost of setting up and maintaining such a categorizer, while retaining good categorization power. The model is stored as a graph of concepts where graph edges represent the cross-reference between the concepts. Concepts and relations are extracted from query logs by an offline Web mining process, which uses a search engine as a powerful summarizer for building a concept graph. Empirical evaluation indicates that the system compares favorably on publicly available data sets (such as KDD Cup 2005) as well as on portions of the current query stream of Yahoo! Search, where it is already changing the experience of millions of Web search users.

Understanding User's Query Intent with Wikipedia

371 Jian Hu, gang wang, Fred Lochovsky and Zheng Chen

Abstract. Understanding the intent behind a user's query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user's intent mainly utilize machine learning techniques. However, it is difficult and often requires much human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results shows that our method significantly outperforms other methods in each intent domain.

Discover Users' Specific Geo Intention in Web Search

197 Xing Yi, Hema Raghavan and Chris Leggetter

Abstract. Discovering users' specific and implicit geographic intention in web search can greatly help satisfy users' information needs. We build a geo intent analysis system that learns a model from large scale web-search logs for this discovery with minimal supervision. We build a city language model, which is a probabilistic representation of the language surrounding the mention of a city in web queries. We use several features derived from these language models to: (1) identify users' implicit geo intent and pinpoint the city corresponding to this intent (2)determine whether the geo-intent is localized around the users' current geographic location. (3) predict cities for queries that have a mention of an entity that is located in a specific place. Experimental results demonstrate the effectiveness of using features derived from the city language model. We find that (1) the system has over 90% precision and more than 74% accuracy for the task of detecting users' implicit city level geo intent. (2) the system achieves more than 96% accuracy in determining whether implicit geo queries are local geo queries, neighbor region geo queries or none-of these (3) the city language model can effectively retrieve cities in location-specific queries with high precision(88%) and recall(74%); human evaluation results show that the language model predicts city labels for location-specific queries with high accuracy (84.5%).

Sponsored Search

Moneti-1 Andrei Broder

Hybrid Keyword Search Auctions

238 Ashish Goel and Kamesh Munagala

Abstract. Search auctions have become a dominant source of revenue generation on the Internet. Such auctions have typically used per-click bidding and pricing. We propose the use of hybrid auctions where an advertiser can make a per-impression as well as a per-click bid, and the auctioneer then chooses one of the two as the pricing mechanism. We assume that the advertiser and the auctioneer both have separate beliefs (called priors) on the click-probability of an advertisement. We first prove that the hybrid auction is truthful, assuming that the advertisers are risk- neutral. We then show that this auction is superior to the existing per-click auction in multiple ways:
1) We show that risk-seeking advertisers will choose only a per-impression bid whereas risk-averse advertisers will choose only a per-click bid, and argue that both kind of advertisers arise naturally. Hence, the ability to bid in a hybrid fashion is important to account for the risk characteristics of the advertisers.
2) For obscure keywords, the auctioneer is unlikely to have a very sharp prior on the click-probabilities. In such situations, we show that having the extra information from the advertisers in the form of a per-impression bid can result in significantly higher revenue.
3) An advertiser who believes that its click-probability is much higher than the auctioneer's estimate can use per-impression bids to correct the auctioneer's prior without incurring any extra cost.
4) The hybrid auction can allow the advertiser and auctioneer to implement complex dynamic programming strategies to deal with the uncertainty in the click- probability using the same basic auction. The per-click and per-impression bidding schemes can only be used to implement two extreme cases of these strategies.
As Internet commerce matures, we need more sophisticated pricing models to exploit all the information held by each of the participants. We believe that hybrid auctions could be an important step in this direction. The hybrid auction easily extends to multiple slots, and is also applicable to scenarios where the hybrid bidding is per-impression and per-action (i.e. CPM and CPA), or per-click and per- action (i.e. CPC and CPA).

Bid Optimization for Broad Match Ad Auctions

684 Vahab Mirrokni, Eyal Even-Dar, Yishay Mansour, S Muthukrishnan and Uri Nadav

Abstract. Ad auctions support the ``broad match'' that allows an advertiser to target a large number of queries by bidding only on a limited number of queries as broad match. While giving more expressiveness to advertisers, this feature makes it challenging for the advertisers to find bidding strategies to optimize their returns: choosing to bid on a query as a broad match because it provides high profit results in one bidding for related queries which may yield low or even negative profits.
We abstract and study the complexity of the {\em bid optimization problem} which is to determine an advertiser's bids on a subset of keywords (possibly using broad match) so that her profit is maximized. In the query language model when the advertiser is allowed to bid on all queries as broad match, we present an linear programming (LP)-based polynomial-time algorithm for the bid optimization problem. In the model in which an advertiser can only bid on keywords, ie., a subset of keywords as an exact or broad match, we show that this problem is not approximable within any reasonable approximation factor unless P=NP. To deal with this hardness result, we present a constant-factor approximation when the optimal profit significantly exceeds the cost. This algorithm is based on rounding a natural LP formulation of the problem. Finally, we study a budgeted variant of the problem, and show that in the query language model, one can find two budget constrained ad campaigns in polynomial time that implement the optimal bidding strategy. Our results are the first to address bid optimization under the broad match feature which is common in ad auctions.

General Auction Mechanism for Search Advertising

388 Gagan Aggarwal, S Muthukrishnan, David Pal and Martin Pal

Abstract. In sponsored search, a number of advertising slots is available on a search results page, and have to be allocated among a set of advertisers competing to display an ad on the page. This gives rise to a bipartite matching market that is typically cleared by the way of an automated auction. Several auction mechanisms have been proposed, with variants of the Generalized Second Price (GSP) being widely used in practice.
There is a rich body of work on bipartite matching markets that builds upon the stable marriage model of Gale and Shapley and the assignment model of Shapley and Shubik. This line of research offers deep insights into the structure of stable outcomes in such markets and their incentive properties.
In this paper, we model advertising auctions in terms of an assignment model with linear utilities, extended with bidder and item specific maximum and minimum prices. Auction mechanisms like the commonly used GSP or the well-known Vickrey-Clarke-Groves (VCG) can be interpreted as simply computing a \emph{bidder-optimal stable matching} in this model, for a suitably defined set of bidder preferences, but our model includes much richer bidders and preferences. We prove that in our model the existence of a stable matching is guaranteed, and under a non-degeneracy assumption a bidder-optimal stable matching exists as well. We give a fast algorithm to find such matching in polynomial time, and use it to design truthful mechanism that generalizes GSP, is truthful for profit-maximizing bidders, correctly implements features like bidder-specific minimum prices and position-specific bids, and works for rich mixtures of bidders and preferences. Our main technical contributions are the existence of bidder-optimal matchings and (group) strategyproofness of the resulting mechanism, and are proved by induction on the progress of the matching algorithm.

Linked Data

Seman-2 Ivan Herman

Large Scale Integration of Senses for the Semantic Web

525 Jorge Gracia, Mathieu d'Aquin and Eduardo Mena

Abstract. Nowadays, the increasing amount of semantic data available on the Web leads to a new stage in the potential of Semantic Web applications. However, it also introduces new issues due to the heterogeneity of the available semantic resources. One of the most remarkable is redundancy, that is, the excess of different semantic descriptions, coming from different sources, to describe the same intended meaning.
In this paper, we propose a technique to perform a large scale integration of senses (expressed as ontology terms), in order to cluster the most similar ones, when indexing large amounts of online semantic information. It can dramatically reduce the redundancy problem on the current Semantic Web. In order to make this objective feasible, we have studied the adaptability and scalability of our previous work on sense integration, to be translated to the much larger scenario of the Semantic Web. Our evaluation shows a good behaviour of these techniques when used in large scale experiments, then making feasible the proposed approach.

Triplify - Light-weight Linked Data Publication from Relational Databases

1 Sören Auer, Sebastian Dietzold, Jens Lehmann, Sebastian Hellmann and David Aumueller

Abstract. We present Triplify - a simplistic but effective approach to publish linked data from relational databases. Triplify is based on mapping HTTP-URI requests onto relational database queries. Triplify transforms the resulting relations into RDF statements and publishes the data on the Web in various RDF serializations, in particular as Linked Data. The rationale for developing Triplify is that the largest part of information on the Web is already stored in structured form, often as data contained in relational databases but published by Web applications merely as HTML mixing structure, layout and content. In order to reveal the pure structured information behind the current Web we implemented Triplify as a light-weight software component, which can be easily integrated and deployed with the numerous widely installed Web applications. Our approach includes a method for publishing update logs to enable incremental crawling of linked data sources. Triplify is complemented by a library of configurations for common relational schemata and a REST-enabled datasource registry. Triplify configurations are provided containing mappings for many popular Web applications, including Wordpress, Drupal, Joomla, Gallery, and phpBB. We show that despite its light-weight architecture Triplify is usable to publish very large datasets, such as 160GB of geo data from the OpenStreetMap project.

SOFIE: Self-Organizing Flexible Information Extraction

60 Fabian M. Suchanek, Mauro Sozio and Gerhard Weikum

Abstract. This paper presents SOFIE, a system that can extend an existing ontology by new facts. SOFIE provides a integrative framework, in which information extraction, word disambiguation and semantic reasoning all become part of one unifying model. SOFIE processes text or Web sources and finds meaningful patterns. It maps the words in the pattern to entities in the ontology. It hypothesizes on the meaning of the pattern, and checks the semantic plausibility of the hypothesis with the existing ontology. Then the new fact is added to the ontology, avoiding inconsistency with the existing facts. The logical model that connects existing facts, new hypotheses, extraction patterns, and consistency constraints is represented as a set of propositional clauses. We use an approximation algorithm for the Weighted MAX SAT problem to compute the most plausible subset of hypotheses. Thereby, the SOFIE framework integrates the paradigms of pattern matching, entity disambiguation, and ontological reasoning into one unified model, and enables the automated growth of large ontologies. Experiments, using the YAGO ontology as existing knowledge and various text and Web corpora as input sources, show that our method yields very good precision around 90 percent or higher.

Ads and Query Expansion

Search-5 Fabrizio Silvestri

A Search-based Method for Forecasting Ad Impression in Contextual Advertising

646 Xuerui Wang, Andrei Broder, Marcus Fontoura and Vanja Josifovski

Abstract. Contextual advertising} refers to the placement of small textual ads within the content of a generic web page. It has become a significant source of revenue for publishers ranging from individual bloggers to major newspapers. At the same time it is an important way for advertisers to reach their intended audience. This reach depends on the total number of exposures of the ad (impressions) and its click-through-rate (CTR) that can be viewed as the probability of an end-user clicking on the ad when shown. These two orthogonal, critical factors are both difficult to estimate and even individually can still be very informative in planning and budgeting advertising campaigns.
In this paper, we address the problem of forecasting the number of impressions for new or changed ads in the system. Producing such forecasts, even within large margins of error, is quite challenging:1) ad selection in contextual advertising is a complicated process based on tens or even hundreds of page and ad features; 2) the publishers' content and traffic vary over time; and 3) the scale of the problem is daunting: over a course of a week it involves billions of impressions, hundreds of millions of distinct pages, hundreds of millions of ads, and varying bids of other competing advertisers. We tackle these complexities by simulating the presence of a given ad with its associated bid over weeks of historical data. We obtain an impression estimate by counting how many times the ad would have been displayed if it were in the system over that period of time. We estimate this count by an efficient two-level search algorithm over the distinct pages in the data set. Experimental results show that our approach can accurately forecast the expected number of impressions of contextual ads in real time. We also show how this method can be used in tools for bid selection and ad evaluation.

Exploiting Web Search Engines to Search Structured Information Sources

764 Sanjay Agrawal, Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Arnd Konig and Dong Xin

Abstract. Web search engines leverage information from structured databases to answer queries. For example, many product related queries on search engines ({\em Amazon, Google, Yahoo, Live Search}) are answered by searching underlying product databases containing names, descriptions, specifications, and reviews of products. However, these vertical search engines are ``{\em silo-ed}'' in that their results are independent of those from web search. This often leads to empty or incomplete results, as query terms are matched against the information in the underlying database only. In order to overcome this problem, we propose an approach that first identifies relationships between web documents and items in structured databases. This allows us to subsequently exploit results from web search engines in combination with these relationships to obtain the structured data items relevant for a much wider range of queries. We propose an architecture that implements the integrated search functionality efficiently, adding very little additional overhead to query processing and is fully integrated with the search engine architecture. We demonstrate the quality of our techniques through an extensive experimental study.

Online Expansion of Rare Queries for Sponsored Search

623 Andrei Broder, Peter Ciccolo, Evgeniy Gabrilovich, Vanja Josifovski, Donald Metzler, Lance Riedel and Jeffrey Yuan

Abstract. Sponsored search systems are tasked with matching queries to relevant advertisements. The current state-of-the-art matching algorithms expand the user's query using a variety of external resources, such as Web search results. While these expansion-based algorithms are highly effective, they are largely inefficient and cannot be applied in real-time. In practice, such algorithms are applied offline to popular queries, with the results of the expensive operations cached for fast access at query time. In this paper, we describe an efficient and effective approach for matching ads against rare queries that were not processed offline. The approach builds an expanded query representation by leveraging offline processing done for related popular queries. Our experimental results show that our approach significantly improves the effectiveness of advertising on rare queries with only a negligible increase in computational cost.

Mining for Semantics

Seman-3 Axel Polleres

Emergent Semantics of Social Tagging

733 Ben Markines, Ciro Cattuto, Filippo Menczer, Dominik Benz, Andreas Hotho and Gerd Stumme

Abstract. Social bookmarking systems are becoming increasingly important data sources for bootstrapping and maintaining Semantic Web applications. Their emergent information structures have become known as folksonomies. A key question for harvesting semantics from these systems is how to extend and adapt traditional notions of similarity to folksonomies, and which measures are best suited for applications such as community detection, navigation support, semantic search, user profiling and ontology learning. Here we build an evaluation framework to compare various general folksonomy-based similarity measures, which are derived from several established information-theoretic, statistical, and practical measures. Our framework deals generally and symmetrically with users, tags, and resources. For evaluation purposes we focus on similarity between tags and between resources and consider different methods to aggregate annotations across users. After comparing the ability of several tag similarity measures to predict user-created tag relations, we provide an external grounding by user-validated semantic proxies based on WordNet and the Open Directory Project. We also investigate the issue of scalability. We find that mutual information with distributional micro-aggregation across users yields the highest accuracy, but is not scalable; per-user projection with collaborative aggregation provides the best scalable approach via incremental computations. The results are consistent across resource and tag similarity.

Measuring the Similarity between Implicit Semantic Relations from the Web

373 Danushka Bollegala, Yutaka Matsuo and Mitsuru Ishizuka

Abstract. Measuring the similarity between semantic relations that hold among entities is an important and necessary step in various Web related tasks such as relation extraction, information retrieval and analogy detection. For example, consider the case in which a person knows a pair of entities (e.g. \textit{Google, YouTube}), between which a particular relation holds (e.g. acquisition). The person is interested in retrieving other_such pairs with similar relations (e.g. \textit{Microsoft, Powerset}). Existing keyword-based search engines cannot be applied directly in this case because, in keyword-based search, the goal is to retrieve documents that are relevant to the words used in a query -- not necessarily to the relations implied by a pair of words. We propose a relational similarity measure, using a Web search engine, to compute the similarity between semantic relations implied by two pairs of words. Our method has three components: representing the various semantic relations that exist between a pair of words using automatically extracted lexical patterns, clustering the extracted lexical patterns to identify the different patterns that express a particular semantic relation, and measuring the similarity between semantic relations using a metric learning approach. We evaluate the proposed method in two tasks: classifying semantic relations between named entities, and solving word-analogy questions. The proposed method outperforms all baselines in a relation classification task with a statistically significant average precision score of $0.74$. Moreover, it reduces the time take by Latent Relational Analysis to process $374$ word-analogy questions from $9$ days to less than $6$ hours, with a SAT score of $51\%$.

Extracting Key Terms From Noisy and Multitheme Documents

366 Maria Grineva, Dmitry Lizorkin and Maxim Grinev

Abstract. We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non- important terms fall into weakly intercon-nected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia. Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more e ective on noisy and multi- theme documents than existing methods.