Browsing by Author "Yates, Ricardo Baeza"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Genealogical trees on the web : a search engine user perspective.(2008) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in a Web snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.Item Um novo retrato da web brasileira.(2005) Modesto, Marco; Pereira Junior, Álvaro Rodrigues; Ziviani, Nivio; Castilho, Carlos; Yates, Ricardo BaezaO objetivo deste artigo ´e avaliar características quantitativas e qualitativas da Web brasileira, confrontando estimativas atuais com estimativas obtidas há cinco anos. Grande parte do conteúdo Web´ e dinâmico e volátil, o que inviabiliza a sua coleta na totalidade. Logo, o processo de avaliação foi realizado sobre uma amostra da Web brasileira, coletada em marco de 2005. Os resultados são estimados de forma consistente, usando uma metodologia eficaz, j´a utilizada em trabalhos similares com Webs de outros países. Dentre os principais aspectos observados neste trabalho estão a distribuição dos idiomas das paginas, o uso de ferramentas abertas versus proprietárias para geração de paginas dinâmicas, a distribuição dos formatos de documentos, a distribuição de tipos de domínios e a distribuição dos links a Web sites externos.Item The evolution of web content and search engines.(2006) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThe evolution of web content and search engines The Web grows at a fast pace and little is known about how new content is generated. The objective of this paper is to study the dynamics of content evolution in the Web, giv-ing answers to questions like: How much new content has evolved from the Web old content? How much of the Web content is biased by ranking algorithms of search engines? We used four snapshots of the Chilean Web containing documents of all the Chilean primary domains, crawled in four distinct periods of time. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. Our hypothesis is that when pages have parents, in a portion of pages there was a query that related the parents and made possible the creation of the new page. Thus, part of the Web content is biased by the ranking function of search engines. We also de¯ne a genealogical tree for the Web, where many pages are new and do not have parents and others have one or more parents. We present the Chilean Web genealogical tree and study its components. To the best of our knowledge this is the ¯rst paper that studies how old content is used to create new content, relating a search engine ranking algorithm with the creation of new pages.Item WCL2R : a benchmark collection for Learning to rank research with clickthrough data.(2010) Alcântara, Otávio D. A.; Pereira Junior, Álvaro Rodrigues; Almeida, Humberto Mossri de; Gonçalves, Marcos André; Middleton, Christian; Yates, Ricardo BaezaWCL2R: A benchmark collection for Learning to rank research with clickthrough data In this paper we present WCL2R, a benchmark collection for supporting research in learning to rank (L2R) algorithms which exploit clickthrough features. Differently from other L2R benchmark collections, such as LETOR and the recently released Yahoo!’s collection for a L2R competition, in WCL2R we focus on defining a significant (and new) set of features over clickthrough data extracted from the logs of a real-world search engine. In this paper, we describe the WCL2R collection by providing details about how the corpora, queries and relevance judgments were obtained, how the learning features were constructed and how the process of splitting the collection in folds for representative learning was performed. We also analyze the discriminative power of the WCL2R collection using traditional feature selection algorithms and show that the most discriminative features are, in fact, those based on clickthrough data. We then compare several L2R algorithms on WCL2R, showing that all of them obtain significant gains by exploiting clickthrough information over using traditional ranking approaches.Item WIM : an information mining model for the web.(2005) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents a model to mine information in ap-plications involving Web and graph analysis, referred to as WIM – Web Information Mining – model. We demonstrate the model characteristics using a Web warehouse. The Web data in the warehouse is modeled as a graph, where nodes represent Web pages and edges represent hyperlinks. In the model, objects are always sets of nodes and belong to one class. We have physical objects containing attributes di-rectly obtained from Web pages and links, as the title of a Web page or the start and end pages of a link. Logical ob-jects can be created by performing predefined operations on any existing object. In this paper we present the model components, propose a set of eleven operators and give ex-amples of views. A view is a sequence of operations on objects, and it represents a way to mine information in the graph. As practical examples, we present views for cluster-ing nodes and for identifying related item sets.