新規登録 | ログイン | FAQ      [?] 
CiteULike is a free online bibliography manager. Register and you can start organising your references online.
Recent | Unread | Search | Authors | Tags | Export

The discoverability of the web

by: Anirban Dasgupta, Arpita Ghosh, Ravi Kumar, Christopher Olston, Sandeep Pandey, Andrew Tomkins
(2007), pp. 421-430.


View FullText article


X Reviews [Write a review of this article]

There are no reviews of this article

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Abstract

Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.


X BibTeX record

X RIS record



RIS BibTeX
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.