Syntactic similarity of web documents.

dc.contributor.authorPereira Junior, Álvaro Rodrigues
dc.contributor.authorZiviani, Nivio
dc.date.accessioned2012-10-18T20:49:54Z
dc.date.available2012-10-18T20:49:54Z
dc.date.issued2003
dc.description.abstractThis paper presents and compares two methods for evaluating the syntactic similarity between documents. The first method uses the Patricia tree, constructed from the original document, and the similarity is computed searching the text of each candidate document in the tree. The second method uses shingles concept to obtain the similarity measure for every document pairs, and each shingle from the original document is inserted in a hash table, where shingles of each candidate document are searched. Given an original doc-ument and some candidates, two methods find documents that have some similarity relationship with the original doc-ument. Experimental results were obtained by using a pla-giarized documents generator system, from 900 documents collected from the Web. Considering the arithmetic ave rage of the absolute differences between the expected and ob-tained similarity, the algorithm that uses shingles obtained a performance of 4,13 % and the algorithm that uses Patricia tree a performance 7.50%pt_BR
dc.identifier.citationPEREIRA JUNIOR, A. R.; ZIVIANI, N. Syntactic similarity of web documents. In. Latin American Web Congress, 1 . 2003. Santiago . Anais... Santiago: Latin American Web Congress, 2003. v. 1. p. 194-200. Disponível em: <http://www.cwr.cl/la-web/2003/stamped/23_pereira_a.pdf>. Acesso em: 18 out. 2012.pt_BR
dc.identifier.urihttp://www.repositorio.ufop.br/handle/123456789/1682
dc.language.isoen_USpt_BR
dc.titleSyntactic similarity of web documents.pt_BR
dc.typeTrabalho apresentado em eventopt_BR
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
EVENTO_SyntacticSimilarityWeb.pdf
Size:
221.39 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: