Browsing by Author "Ziviani, Nivio"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item Genealogical trees on the web : a search engine user perspective.(2008) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in a Web snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.Item Geração de impressão digital para recuperação de documentos similares na web(2004) Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents a mechanism for the generation of the “finger-print” of a Web document. This mechanism is part of a system for detecting and retrieving documents from the Web with a similarity relation to a suspicious do-cument. The process is composed of three stages: a) generation of a fingerprint of the suspicious document, b) gathering candidate documents from the Web and c) comparison of each candidate document and the suspicious document. In the first stage, the fingerprint of the suspicious document is used as its identifica-tion. The fingerprint is composed of representative sentences of the document. In the second stage, the sentences composing the fingerprint are used as queries submitted to a search engine. The documents identified by the URLs returned from the search engine are collected to form a set of similarity candidate do-cuments. In the third stage, the candidate documents are “in-place” compared to the suspicious document. The focus of this work is on the generation of the fingerprint of the suspicious document. Experiments were performed using a collection of plagiarized documents constructed specially for this work. For the best fingerprint evaluated, on average87.06%of the source documents used in the composition of the plagiarized document were retrieved from the Web.Item In search of a stochastic model for the E-News Reader.(2019) Veloso, Bráulio Miranda; Assunção, Renato Martins; Ferreira, Anderson Almeida; Ziviani, NivioE-news readers have increasingly at their disposal a broad set of news articles to read. Online newspaper sites use recommender systems to predict and to offer relevant articles to their users. Typically, these recommender systems do not leverage users’ reading behavior. If we know how the topics-reads change in a reading session, we may lead to fine-tuned recommendations, for example, after reading a certain number of sports items, it may be counter-productive to keep recommending other sports news. The motivation for this article is the assumption that understanding user behavior when reading successive online news articles can help in developing better recommender systems. We propose five categories of stochastic models to describe this behavior depending on how the previous reading history affects the future choices of topics. We instantiated these five classes with many different stochastic processes covering short-term memory, revealed-preference, cumulative advantage, and geometric sojourn models. Our empirical study is based on large datasets of E-news from two online newspapers. We collected data from more than 13 million users who generated more than 23 million reading sessions, each one composed by the successive clicks of the users on the posted news. We reduce each user session to the sequence of reading news topics. The models were fitted and compared using the Akaike Information Criterion and the Brier Score. We found that the best models are those in which the user moves through topics influenced only by their most recent readings. Our models were also better to predict the next reading than the recommender systems currently used in these journals showing that our models can improve user satisfaction.Item Um novo retrato da web brasileira.(2005) Modesto, Marco; Pereira Junior, Álvaro Rodrigues; Ziviani, Nivio; Castilho, Carlos; Yates, Ricardo BaezaO objetivo deste artigo ´e avaliar características quantitativas e qualitativas da Web brasileira, confrontando estimativas atuais com estimativas obtidas há cinco anos. Grande parte do conteúdo Web´ e dinâmico e volátil, o que inviabiliza a sua coleta na totalidade. Logo, o processo de avaliação foi realizado sobre uma amostra da Web brasileira, coletada em marco de 2005. Os resultados são estimados de forma consistente, usando uma metodologia eficaz, j´a utilizada em trabalhos similares com Webs de outros países. Dentre os principais aspectos observados neste trabalho estão a distribuição dos idiomas das paginas, o uso de ferramentas abertas versus proprietárias para geração de paginas dinâmicas, a distribuição dos formatos de documentos, a distribuição de tipos de domínios e a distribuição dos links a Web sites externos.Item s-WIM : a scalable web information mining tool.(2012) Melo, Felipe Santiago Martins Coimbra de; Pereira Junior, Álvaro Rodrigues; Pereira Junior, Álvaro Rodrigues; Lima, Joubert de Castro; Souza, Fabrício Benevenuto de; Ziviani, NivioMineração Web pode ser vista como o processo de encontrar padrões na Web por meio de técnicas de mineração de dados. Mineração Web é uma tarefa computacionalmente intensiva, e a maioria dos softwares de mineração são desenvolvidos isoladamente, o que torna escalabilidade e reusabilidade difı́cil para outras tarefas de mineração. Mineração Web é um processo iterativo onde prototipagem tem um papel essencial para experimentar com diferentes alternativas, bem como para incorporar o conhecimento adquirido em iterações anteriores do processo. Web Information Mining (WIM) constitui um modelo para prototipagem rápida em mineração Web. A principal motivação para o desenvolvimento do WIM foi o fato de que seu modelo conceitual provê seus usuários com um nı́vel de abstração apropriado para prototipagem e experimentação durante a tarefa de mineração. WIM é composto de um modelo de dados e de uma álgebra. O modelo de dados WIM é uma visão relacional dos dados Web. Os três tipos de dados existentes na Web, chamados de conteúdo, de estrutura e dados de uso, são representados por relações. Os principais componentes de entrada do modelo de dados WIM são as páginas Web, a estrutura de hiper- links que interliga as páginas Web, e os históricos (logs) de consultas obtidos de máquinas de busca da Web. A programação WIM é baseada em fluxos de dados (dataflows), onde sequências de operações são aplicadas às relações. As operações são definidas pela álgebra WIM, que contém operadores para manipulação de dados e para mineração de dados. WIM materializa uma linguagem de programação declarativa provida por sua álgebra. O objetivo do presente trabalho é o desenho de software e o desenvolvimento do Scalable Web Information Mining (s-WIM), a partir do modelo de dados e da álgebra apresentados pelo WIM. Para dotar os operadores com a escalabilidade desejada – e consequentemente os programas gerados por eles – o s-WIM foi desenvolvido sobre as plataformas Apache Hadoop e Apache HBase, que provêem escalabilidade linear tanto no armazenamento quanto no processamento de dados, a partir da adição de hardware. A principal motivação para o desenvolvimento do s-WIM é a falta de ferramentas livres que ofereçam tanto o nı́vel de abstração provido pela álgebra WIM quanto a escalabilidade necessária à operação sobre grandes bases de dados. Além disso, o nı́vel de abstração provido pela álgebra do WIM permite que usuários sem conhecimentos avançados em linguagens de programação como Java ou C++ também possam utilizá-lo. O desenho e a arquitetura do s-WIM sobre o Hadoop e o HBase são apresentados nesse trabalho, bem como detalhes de implementação dos operadores mais complexos. São também apresentados diversos experimentos e seus resultados, que comprovam a escalabilidade do s-WIM e consequentemente, seu suporte à mineração de grandes volumes de dados.Item Syntactic similarity of web documents.(2003) Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original doc-ument and some candidates, two methods find documents that have some similarity relationship with the original doc-ument. Experimental results were obtained by using a pla-giarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic ave rage of the absolute differences between the expected and ob-tained similarity, the algorithm that uses shingles obtained a performance of 4,13 % and the algorithm that uses Patricia tree a performance 7.50%Item The evolution of web content and search engines.(2006) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThe evolution of web content and search engines The Web grows at a fast pace and little is known about how new content is generated. The objective of this paper is to study the dynamics of content evolution in the Web, giv-ing answers to questions like: How much new content has evolved from the Web old content? How much of the Web content is biased by ranking algorithms of search engines? We used four snapshots of the Chilean Web containing documents of all the Chilean primary domains, crawled in four distinct periods of time. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. Our hypothesis is that when pages have parents, in a portion of pages there was a query that related the parents and made possible the creation of the new page. Thus, part of the Web content is biased by the ranking function of search engines. We also de¯ne a genealogical tree for the Web, where many pages are new and do not have parents and others have one or more parents. We present the Chilean Web genealogical tree and study its components. To the best of our knowledge this is the ¯rst paper that studies how old content is used to create new content, relating a search engine ranking algorithm with the creation of new pages.Item WIM : an information mining model for the web.(2005) Yates, Ricardo Baeza; Pereira Junior, Álvaro Rodrigues; Ziviani, NivioThis paper presents a model to mine information in ap-plications involving Web and graph analysis, referred to as WIM – Web Information Mining – model. We demonstrate the model characteristics using a Web warehouse. The Web data in the warehouse is modeled as a graph, where nodes represent Web pages and edges represent hyperlinks. In the model, objects are always sets of nodes and belong to one class. We have physical objects containing attributes di-rectly obtained from Web pages and links, as the title of a Web page or the start and end pages of a link. Logical ob-jects can be created by performing predefined operations on any existing object. In this paper we present the model components, propose a set of eleven operators and give ex-amples of views. A view is a sequence of operations on objects, and it represents a way to mine information in the graph. As practical examples, we present views for cluster-ing nodes and for identifying related item sets.