VERİMLİ EŞLEŞTİRME SORGUSU ÇIKARIMI İÇİN WEB GÜNLÜK HAREKETLERİNİN GURUPLANDIRILMASI. Mehmet Uluer

öz VERİMLİ EŞLEŞTİRME SORGUSU ÇIKARIMI İÇİN WEB GÜNLÜK HAREKETLERİNİN GURUPLANDIRILMASI Mehmet Uluer YÜKSEK LİSANS TEZİ BİLGİSAYAR MÜHENDİSLİĞİ BÖLÜMÜ Ankara, 2003 Web, her geçen gün biraz daha büyümekte ve her türden veri kaynağını içinde barındırmaktadır. Büyümeyle birlikte artan, meşru olmayan pek çok kaynağıda içine almasına rağmen, insanoğlunun günümüze kadar oluşturabildiği en değerli veri hazinesidir. Veri madenciliği teknikleri kullanılarak Web kullanıcılarının davranış biçimlerinin ortaya çıkarılabilmesi ve analiz edilebilmesi, sistem başarısını artırmış ve İnternet'teki bilgi servislerinin son kullanıcıya kaliteli bir biçimde ulaştırılmasını sağlamıştır. Bunun yanında pek çok kuruluşa, müşteri devamlılığının ölçülmesi, pazarladıkları ürüne göre çapraz satış stratejilerinin belirlenmesi, elektronik ticarette müşteri potansiyellerinin tahmini gibi konularda önemli yardımları olmuştur. Etkili satış stratejilerinin belirlenmesinde ve Web sitesinin mantıksal yapısının optimizasyonunda, site kullanımının analiz edilmesi çok kritik bir rol oynar. Erişim örüntülerinin analizi, Web ortamında yayınlanan reklamlarında hedef kitlelere ulaşmasını sağlar. Web günlük madenciliği, kullanım örüntülerinin çıkarımı için, veri madenleme tekniklerinin Web günlüklerine uygulanmasıdır. Web sitelerinin karmaşıklığı ve büyümesi ilerledikçe, site tasarımı, iş ve pazarlama için karar-destek uygulamaları, kişiselleştirme, kullanılabilirlik çalışmaları ve ağ trafiği analizi gibi uygulamalarda web günlük madenciliğinin sonuçları kritik bir önem taşır. Web günlük madenciliğinin en temel iki problemi, sitenin erişimine ilişkin kusursuz bir tablo ortaya koyabilmek için ham verinin ön işlemeden geçirilmesi ve sadece ilginç örüntü ve kuralların ortaya konulabilmesi için çeşitli veri madenleme algoritmalarının sonuçlarının filtrelenmesi konulan olmuştur. Gruplama ve eşleştirme kuralları çıkarımı, veri tabanlarında bilgi keşfi konusunda önemli araştırma alanlarından ikisidir ve son zamanlarda veri madenciliği topluluklarının çok ilgisini çekmiştir. Bu çalışmada, ham Web günlük hareketlerini analiz eden ve ön işlemeden geçiren, gezinti örüntülerini kullanarak öz-örgütsel bir harita üzerinde düzenleyen ve bütün hareketler için eşleştirme kuralları bulmak yerine her grup için ayrı ayrı eşleştirme kurallarını çıkaran bir prototip sistem ortaya konulmuştur. Öz-örgütsel bir harita, kullanıcının siteyi ziyaretinde gezdiği sayfalar kümesinden oluşan hareketler verisini gruplandırır. Diğer taraftan kullanıcının siteyi ziyaretinde gezdiği sayfalar arasındaki ilişkileri çıkarmak için eşleştirme sorgusu çıkarım tekniklerini kullanılır. Bu bağlamda, Öz-örgütsel haritanın ürettiği gruplar üzerinde apriori algoritması uygulanarak bütün sorguları çıkarmak yerine her grup için ayrı ayrı sorgular bulunur. Bu şekilde, sistem siteye gelen ziyaretçinin içinde bulunduğu grubu tespit ederek, bu gruba ait davranış özelliklerine bağlı olarak ziyaretçiye daha kişiselleştirilmiş bir yayın yapar. Anahtar Kelimeler: Web Günlük Madenciliği, Öz-Örgütsel Haritalar, Eşleştirme Sorgusu Çıkarımı, Kişiselleştirme.

ABSTRACT CLUSTERING WEB USAGE TRANSACTIONS FOR EFFICIENT ASSOCIATION RULE MINING Mehmet Uluer MASTER THESIS IN THE DEPARTMENT OF COMPUTER ENGINEERING Ankara, 2003 The World Wide Web continuously growing and collecting all kinds of resources. Despite the anarchy in which it is growing, the Web is one of the biggest repositories ever built. As a confluence of data mining and World Wide Web technologies, analyzing and exploring regularities using data mining in Web user behavior can improve system performance and enhance the quality and delivery of Internet information services to the end user. It can also help organizations to determine the life time value of their customers and cross marketing strategies across products and identify population of potential customers for electronic commerce. Analysis of how users are accessing a site is critical for determining effective marketing strategies and optimizing the logical structure of the Web site. For selling advertisements on the World Wide Web, analyzing user access patterns helps in targeting ads to specific groups of users. Web usage mining is the application of data mining techniques to Web clickstream data in order to extract usage patterns. As Web sites continue to grow in size and complexity, the results of Web usage mining have become critical for a number of applications such as Web site design, business and marketing decision support, personalization, usability studies, and network traffic analysis. The two major challenges involved in Web usage mining are preprocessing the raw data to provide an accurate picture of how a site is being used, and filtering the results of the various data mining algorithms in order to present only the rules and patterns that are potentially interesting. Clustering and association rules are two important research areas of knowledge discovery in databases and have recently received much attention from the data mining community. In this thesis, we present a prototype system that analyzes and preprocesses the raw web server logs, organizes web usage transactions on a self-organizing map according to user navigation patterns and keeps track of the association rules for each cluster rather than finding the rules for all transactions. The SOM clusters whole transaction data where each transaction is comprised of a set of URLs accessed by a client in one visit to the server. On the other hand, association rule discovery techniques came up to discovering the correlations between pages. In this manner, the apriori algorithm applied to clusters produced by the SOM to discover association rules on each cluster rather than discovering whole rules. Consequently, the system provides a more personalized Web Site for the current visitor, based on the usage behavior of the cluster to which the visitor belongs. Keywords: Web Usage Mining, Self-Organizing Maps, Association Rules, Personalization.

REFERENCES [1] Michael J. A. Berry, Gordon Linoff, Data Mining Techniques : For Marketing, Sales, and Customer Support, John Wiley & Sons, Inc., (1997). [2] M. J. Zaki. Parallel sequence mining on shared-memory machines. In Vipin Kumar, Sanjay Ranka, and Vineet Singh, editors, Journal of Parallel and Distributed Computing, special issue on High Performance Data Mining, volume 61, pages 401 426, (2001). [3] R. Cooley, B. Mobasher, and J. Srivastava. Grouping web page references into transactions for mining world wide web browsing patterns. Technical Report TR 97-021, University of Minnesota, Dept. of Computer Science, Minneapolis, (1997). [4] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. Dmql: A data mining query language for relational databases. In SIGMOD'96 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'96), Montreal, Canada, (1996). [5] Oren Etzioni. The world wide web: Quagmire or gold mine. Communications of the ACM, 39(11), (1996). [6] C. M. Brown, P. B. Danzig, D. R. Hardy, U. Manber and M. F. Schwartz. The Harvest Information Discovery and Access System. In the 2nd International WWW Conference, (1994). [7] HBML95 K. Hammond, R. Burke, C. Martin, and S. Lytinen. Faq-finder: A case-based approach to knowledge navigation. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press, (1995). [8] C. Kwok and D. Weld. Planning to gather information. In Proc. 14th National Conference on AI, (1996). [9] E. Spertus. Parasite: mining structural information on the web. In Proc, of 6th International World Wide Web Conference, (1997).

[10] R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison shopping agent for the world wide web. Technical Report 96-01-03, University of Washington, Dept. of Computer Science and Engineering, (1996). [11] M. Perkowitz and O. Etzioni. Category translation: learning to understand information on the internet. In Proc. 15th International Joint Conference on AI, pages 930 936, Montral, Canada, (1995). [12] C. Chang and C. Hsu. Customizable multi-engine search tool with clustering. In Proc, of 6th International World Wide Web Conference, (1997). [13] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G Zweig. Syntactic clustering of the web. In Proc. of 6th International World Wide Web Conference, (1997). [14] Y. S. Maarek and I.Z. Ben Shaul. Automatically organizing bookmarks per content. In Proc. of 5th International World Wide Web Conference, (1996). [15] M. R. Wulfekuhler and W. F. Punch. Finding salient features for personal web page categorization. In Proc. of 6th International World Wide Web Conference, (1997). [16] R. Weiss, B. Velez, M. A. Sheldon, C. Namprempre, P. Szilagyi, A. Duda, and D. K. Gifford. Hypursuit: a hierarchical network search engine that exploits content-link hpertexxt clustering. In Hypertext'96: The Seventh ACM Conference on Hypertext, (1996). [17] R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell. Webwatcher: A learning apprentice for the world wide web. In Proc. AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments. (1995). [18] K. A. Oostendorp, W. F. Punch, and R. W. Wiggins. A tool for individualizing the web. In Proc. 2nd International World Wide Web Conference, (1994). [19] H. Sever, J.S. Deogun and V.V. Raghavan, Structural Abstractions of Hypertext Documents for Web-based Retrieval, Proceedings of Ninth International Workshop on Database and Expert Systems Applications, IEEE Computer Society, 26-28 August, pp. 385-390(1998).

[20] O. R. Zaiane and J. Han. Resource and knowledge discovery in global information systems: A preliminary design and experiment. In Proc. of the First Int'l Conference on Knowledge Discovery and Data Mining, pages 331 336, Montreal, Quebec, (1995). [21] I. Khosla, B. Kuhn, and N. Soparkar. Database search using information mining. In Proc. of 1996 ACM-SIGMOD Int. Conf. on Management of Data, (1996). [22] R. King and M. Novak. Supporting information infrastructure for distributed, heterogeneous knowledge discovery. In Proc. SIGMOD 96 Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada, (1996). [23] L. Lakshmanan, F. Sadri, and I. N. Subramanian. A declarative language for querying and restructuring the web. In Proc. 6th International Workshop on Research Issues in Data Engineering: Interoperability of Nontraditional Database Systems (RIDE-NDS'96), (1996). [24] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semi structured heterogeneous information. In International Conference on Deductive and Object Oriented Databases, (1995). [25] Brin, S. and Page, L. The anatomy of a large scale hypertextual web search engine.computer Networks and ISDN Systems, 30: 107-117, (1998). [26] David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Inferring web communities from link topology. In Conference on Hypertext and Hypermedia. ACM, (1998). [27] M. Balabanovic and Y. Shoham. Learning.. information retrieval agents: Experiments with automated web browsing. In On-line Working Notes of the AAAI Spring Symposium Series on Information Gathering from Distributed, Heterogeneous Environments, (1995). [28] Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell,Kamal Nigam, and Sean Slattery. Learning to extract symbolic knowledge from the world wide web. In National Conference on Articial Intelligence (AAAI), (1998).

[29] Jerome Moore, Eui-Hong (Sam) Han, Daniel Boley, Maria Gini, Robert Gross,Kyle Hastings, George Karypis, Vipin Kumar, and Bamshard Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, (1997). [30] Robert Cooley, Pang-Ning Tan, and Jaideep Srivastava. Websift: The web site information filter system. In WEBKDD, San Diego, CA, (1999). [31] Myra Spiliopoulou and Lukas C Faulstich. Wum: A web utilization miner. In EDBT Workshop WebDB98, Valencia, Spain, (1998). [32] Kun-lung Wu, Philip S Yu, and Allen Ballman. Speedtracer: A web usage mining and analysis tool. IBM Systems Journal, 37(1), (1998). [33] Cyrus Shahabi, Amir M Zarkesh, Jafar Adibi, and Vishal Shah. Knowledge discovery from users web-page navigation. In Workshop on Research Issues in Data Engineering, Birmingham, England, (1997). [34] M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In 16th International Conference on Distributed Computing Systems, pages 385{392,(1996). [35] Amir Zarkesh, Jafar Adibi, Cyrus Shahabi, Reza Sadri, and Vishal Shah. Analysis and design of server informative www-sites. In Sixth International Conference on Information and Knowledge Management, Las Vegas, Nevada, (1997). [36] Hayri Sever and Buket Oguz. Veri tabanlarında bilgi keşfine formel b i r yaklaşim. Bilgi Dünyası, 173-204, (2002). [37] Bamshad Mobasher, Robert Cooley, and Jaideep Srivastava. Creating adaptive web sites through usage-based clustering of urls. In Knowledge and Data Engineering Workshop, (1999). [38] R. Agrawal. Data mining: Crossing the chasm. Invited talk at the 5th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining(KDD99), (1999). [39] D.S.W. Ngu and X. Wu. Sitehelper: A localized agent that helps incremental exploration of the world wide web.in 6th International World Wide Web Conference, Santa Clara,CA,(1997).

[40] H. Lieberman. Letizia: An agent that assists web browsing. In Proc. of the 1995 International Joint Conference on Articial Intelligence, Montreal, Canada, (1995). [41] Yan et al. 2001: L.-L. Yan, R.J. Miller, L.M. Haas, R. Fagin. Data-Driven Understanding and Refinement of Schema Mappings. Proc. ACM-SIGMOD Int'l Conf. on Management of Data, Santa Barbara, CA, May (2001). [42] E. Cohen, B. Krishnamurthy, and J. Rexford. Improving end-to-end performance of the web using server volumes and proxy filters. In Proceedings of ACM SIGCOMM, (1998). [43] Daniel Menasce, Virgilio Almeida, Rodrigo Fonseca, and Marco Mendes. A methodology for workload characterization of e-commerce sites. In Electronic Commerce, Denver, Colorado, (1999). [44] Charu C Aggarwal and Philip S Yu. On disk caching of web objects in proxy servers. In CIKM 97, pages 238 {245, Las Vegas, Nevada, (1997). [45] Kevin Larson and Mary Czerwinski. Web page design: Implications of memory, structure and scent for information retrieval. In CHI 1998, Los Angeles, CA, (1998). [46] Mike Perkowitz and Oren Etzioni. Adaptive web sites: Conceptual cluster mining. In Sixteenth International Joint Conference on Articial Intelligence, Stockholm, Sweden, (1999). [47] A.G. Buchner, M. Baumgarten, S. S. Anand, M.D. Mulvenna, and J.G. Hughes. Navigation pattern discovery from internet data. In WEBKDD, San Diego, CA, (1999). [48] J. S. Deogun, V. V. Raghavan, A. Sarkar, and H. Sever, Data Mining: Trends in Research and Development. Rough Sets and Data Mining: Analysis for imprecise Data. (T. Y. Lin and N. Cercone, Eds.), Kluwer Academic Publishers, pp. 9-45, (1997). [49] H. Sever, V.V. Raghavan, and T. D. Johnsten, The State of Rough Sets for Knowledge Discovery in Databases, ICNPAA-98: Second International Conference on Nonlinear Problems in Aviation and Aerospace, Daytona Beach, Florida, USA, Europcn Conference Publications (S. Sivasundaram, Ed.), Cambridge, UK, Vol. 2, pp. 673-680, (1998).

[50] M.E. Kucuk, B. Olgun, and H. Sever, Application of Metadata Concepts to Discovery of Internet Resources, Lecture Notes in Computer Science (LNCS), Springer Verlag, Vol. 1909, pp. 304-13, (2000). [51] B. Olgun and H. Sever, Internet Kaynak Kepfi: Dublin Core Üstveri Editörü, Bilgi Dünyası, 1(1), pp. 56-88, (2000). [52] J.S. Deogun, H. Sever, and V.V. Raghavan, Structural Abstractions of Hypertext Documents for Web-based Retrieval, Proceedings of Ninth International Workshop on Database and Expert Systems Applications, (in conjunction with DEXA'98) -- editor by Roland R. Wagner, IEEE Computer Society (Los Alamitos, California), 26-28 August, 1998, Vienna, Austria, pp. 385-390. [53] V. V. Raghavan, H. Sever, and J. S. Deogun, Exploiting Upper Approximations in the Rough Set Model, Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95), Sponsored by AAAI in cooperation with IJCAI, Montreal, Quebec, Canada, August, pp. 69-74, (1995). [54] Bernardo Huberman, Peter Pirolli, James Pitkow, and Rajan Kukose. Strong regularities in world wide web suing. Technical report, Xerox PARC, (1998). [55] Stephen Lee Manley. An Analysis of Issues Facing World Wide Web Servers. Undergraduate, Harvard, (1997). [56] Ed Chi, James Pitkow, Jock Mackinlay, Peter Pirolli, Rich Gossweiler, and Stuart Card. Visualizing the evolution of web ecologies. In CHI 98, pages 400 {407, Los Angeles, CA, (1998). [57] Robert Cooley, Pang-Ning Tan, and Jaideep Srivastava. Discovery of interesting usage patterns from web data. In Myra Spiliopoulou, editor, LNCS/LNAI Series. Springer-Verlag, (2000). [58] B. Oguz, H. Sever ve M. Tolun, Epleptirme Sorgularynyn Modellenmesi, The ninth Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN'00, June 21-23, Izmir, Turkiye, pp. 381-390, (2000).

[59] T. Kohonen, Construction of similarity diagrams for phonemes by a self-organizing algorithm, Technical Report TKK-FA463, Helsinki University of Technology, Espoo, Finland (1981). [60] T. Kohonen, Self-organized formation of topologically correct feature maps, Biological Cybernetics 43, (1982). [61] K. Lagus, T. Honkela, S. Kaski, T. Kohonen, Self-organizing maps of document collections: a new approach to interactive exploration, Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, (1996). [62] S. Kaski, T. Honkela, K. Lagus, T. Kohonen, WEBSOM self-organizing maps of document collections, Neurocomputing 21 (1-3) (1998). [63] T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, A. Saarela, Self organization of a massive document collection, IEEE Transactions on Neural Networks 11 (3) (2000). [64] R. Cooley, B. Mobasher, J. Srivastava, Data preparation for mining Worldwide Web browsing patterns, Journal of Knowledge and Information Systems 1(1) (1999). / [65] S. K. Choubey, J. S. Deogun, V. V. Raghavan, and H. Sever, On Feature Selection and Effective Classifiers, Journal of American Society for Information Science (JASIS), 49(5), May 1998, pp. 423-434. [66] Christian Hidber, Online Association Rule Mining, SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, (1999). [67] R. Srikant and R. Agrawal. Mining generalized association rules. In Proc, of the 21th VLDB Conference, pages 407-419, Zurich, Switzerland, (1995). [68] L. Catledge and J. Pitkow. Characterizing browsing behaviors on the world wide web. Computer Networks and ISDN Systems, 27(6), (1995). [69] Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow's ear: Extracting usable structures from the web. In CHI-96, Vancouver, (1996).

[70] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami, Mining Association Rules Between Sets of Items in Large Databases, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207-216, Washington, D.C. (1993). [71] David Wai-Lok Cheung, Vincent T. Ng, Ada Wai-Chee Fu, and Yongjian Fu, Efficient Mining of Association Rules in Distributed Databases, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 911-922, (1996). [72] Rakesh Agrawal and Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the Twentieth International Conference on Very Large Databases, pp. 487-499, Santiago, Chile, (1994). [73] Mika Klemettinen, Heikki Mannila, Pirjo Ronkainen, Hannu Toivonen and A. Inkeri Verkamo, Finding Interesting Rules From Large Sets of Discovered Association Rules", Proceedings of the Third International Conference on Information and Knowledge Management (CIKM'94), November (1994). [74] Mohammed Javeed Zaki, Parallel and Distributed Association Mining: A Survey, IEEE Concurrency, October-December (1999). [75] M. Houtsma and A. Swami, Set-Oriented Mining for Association Rules in Relational Databases, Proceedings of the 11th IEEE International Conference on Data Engineering, pp. 25-34, Taipei, Taiwan, March (1995). [76] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo, Efficient Algorithms for Discovering Association Rules, Proceedings of the AAA1 Workshop on Knowledge Discovery in Databases (KDD-94), (1994). [77] Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe, An Efficient Algorithm for Mining Association Rules in Large Databases, Proceedings of the 21nd International Conference on Very Large Databases, pp. 432-444, Zurich, Swizerland, (1995). [78] Hannu Toivonen, Sampling Large Databases for Association Rules, Proceedings of the 22nd International Conference on Very Large Databases, pp. 134-145, Mumbai, India, (1996). [79] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur, Dynamic Itemset Counting and Implication Rules for Market Basket Data, Proceedings of the ACM SIGMOD Conference, pp. 255-264, (1997).