• Non ci sono risultati.

There are two different directions for the future works: one is to use machine learning techniques or background knowledge to improve the clustering pro-cess; the second one is to use the obtained hierarchy like an input for a post-processing algorithm to improve the information supplied to the user.

1 ) I plan to investigate much more deeply the relationship between clus-tering techniques and feature selection combining these two techniques. From our preliminary results we have seen that clustering (hierarchical one in this thesis) can guide the process of feature selection and also the feature selec-tion, used to select a subspace for distance computation helps the clustering approach to obtains very good results. Using background knowledge is very interesting because pushing into the hierarchy construction some background knowledge like a constraints defined by the user can improve the results of

7.2. FUTURE WORKS the whole process: and this procedure allows to personalize the structured results taking account of user needs. To do this, fortunately nowadays, a lot of useful resources are present on the web. For this reason, extracting back-ground knowledge and pushing it inside an unsupervised approach seems a promising goal.

2 ) From the other point of view, an interesting thing is to study a good way to label each cluster: in this way we can supply the user a good way to visualize and understand the data. Using the output of the proposed approach like an input, for instance using criteria like MDL and others, we can cut the hierarchy to obtain a good flat clustering.

Bibliography

[1] Charu C. Aggarwal, Stephen C. Gates, and Philip S. Yu. On the merits of building categorization systems by supervised clustering. In Proc. of 5th ACM Int. Conf. on Knowledge Discovery and Data Mining, pages 352–356, San Diego, US, 1999.

[2] Amir Ahmad and Lipika Dey. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recogn. Lett., 28(1):110–118, 2007.

[3] Aris Anagnostopoulos, Anirban Dasgupta, and Ravi Kumar. Approxi-mation algorithms for co-clustering. In Proceedings PODS 2008, pages 201–210, Vancouver, BC, Canada, 2008.

[4] Michael R. Anderberg. Cluster analysis for applications. Academic, second edition, 1973.

[5] Periklis Andritsos, Panayiotis Tsaparas, Rene J. Miller, and Kenneth C.

Sevcik. Scalable clustering of categorical data. In Proc. of EDBT 2004, pages 123–146, 2004.

[6] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and J¨org Sander. Optics: ordering points to identify the clustering structure.

In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 49–60, New York, NY, USA, 1999. ACM.

[7] L. Douglas Baker and Andrew Kachites McCallum. Distributional clus-tering of words for text classification. In SIGIR ’98: Proc. of the 21st annual Int. ACM SIGIR Conf. on Research and development in infor-mation retrieval, pages 96–103, 1998.

[8] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D.S. Modha. A generalized maximum entropy approach to bregman co-clustering and

matrix approximation. Journal of Machine Learning Research, 8:1919–

1986, 2007.

[9] Daniel Barbara, Julia Couto, and Yi Li. Coolcat: an entropy-based algorithm for categorical clustering. In Proc. of CIKM 2002, pages 582–

589. ACM Press, 2002.

[10] Asa Ben-hur and Isabelle Guyon. Detecting stable clusters using prin-cipal component analysis. In In Methods in Molecular, pages 159–182.

Humana Press, 2003.

[11] Christopher M. Bishop. Pattern Recognition and Machine Learning (In-formation Science and Statistics). Springer, August 2006.

[12] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/ mlearn/MLRepository.html, 1998.

[13] C. L. Blake and C. J. Merz. Uci repository of machine learning databases, http://www.ics.uci.edu/ ˜mlearn/mlrepository.html, 1998.

[14] Richard Butterworth, Gregory Piatetsky-Shapiro, and Dan A. Simovici.

On feature selection through clustering. In ICDM ’05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 581–584.

IEEE Computer Society, 2005.

[15] Varun Ch, Arindam Banerjee, Vipin Kumar, and Varun Chandola. Out-lier detection: A survey, 2007.

[16] Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Scalable feature selection, classification and signature gen-eration for organizing large text databases into hierarchical topic tax-onomies. VLDB Journal, 7(3):163–178, 1998.

BIBLIOGRAPHY [17] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings SIAM SDM 2004, Lake Buena Vista, USA, 2004.

[18] Shui-Lung Chuang and Lee-Feng Chien. A practical web-based approach to generating topic hierarchy for text segments. In CIKM ’04: Proc. of the 13th ACM Int. Conf. on Information and Knowledge Management, pages 127–136, 2004.

[19] Chris Clifton, Robert Cooley, and Jason Rennie. Topcat: Data mining for topic identification in a text corpus. IEEE Transactions on Knowl-edge and Data Engineering, 16(8):949–964, 2004.

[20] Gianni Costa, Giuseppe Manco, and Riccardo Ortale. A hierarchical model-based approach to co-clustering high-dimensional data. In Pro-ceedings of ACM SAC 2008, pages 886–890, Fortaleza, Ceara, Brazil, 2008.

[21] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(1-4):131–156, 1997.

[22] I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings ACM SIGKDD 2003, pages 89–98, Washing-ton, USA, 2003. ACM Press.

[23] Inderjit S. Dhillon, Yuqiang Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In ICDM ’02:

Proc. of the 2002 IEEE Int. Conf. on Data Mining (ICDM’02), page 131, 2002.

[24] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. A divisive information-theoretic feature clustering algorithm for text classification.

Journal of Machine Learning Research, 3:1265–1287, 2003.

[25] Inderjit S. Dhillon and Dharmendra S. Modha. Concept decompositions for large sparse text data using clustering. Mach. Learn., 42(1-2):143–

175, 2001.

[26] Inderjit S. Dhillon and Dharmendra S. Modha. Concept decompositions for large sparse text data using clustering. Mach. Learn., 42(1/2):143–

175, 2001.

[27] Rosa Meo Dino Ienco and Ruggero G. Pensa. Context-based distance learning for categorical data clustering. In Proceedings of IDA 2009, Lecture Notes in Computer Science. Springer, 2009.

[28] Ruggero G. Pensa Dino Ienco and Rosa Meo. Parameter-free hierarchical co-clustering by n-ary splits. In Proceedings of ECML-PKDD 2009, Lecture Notes in Computer Science. Springer, 2009.

[29] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, and Xiaowei Xu. Incremental clustering for mining in a data warehousing environment. In VLDB’98: Proc. of the Int. Conf. on Very Large Data Bases, 1998.

[30] Martin Ester, Hans-Peter Kriegel, J¨org Sander, and Xiaowei Xu.

A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of KDD Conference, pages 226–231, 1996.

[31] George Forman. An extensive empirical study of feature selection met-rics for text classification. Journal of Machine Learning Research, 3:1289–1305, 2003.

[32] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. Cactus-clustering categorical data using summaries. In Proc. of ACM SIGKDD 1999, pages 73–83, 1999.

BIBLIOGRAPHY [33] Stephen C. Gates, Wilfried Teiken, and Keh-Shin F. Cheng. Taxonomies by the numbers: building high-performance taxonomies. In ACM CIKM

’05: Proc. of the 14th ACM international conference on Information and knowledge management, pages 568–577, 2005.

[34] L. A. Goodman and W. H. Kruskal. Measures of association for cross classification. Journal of the American Statistical Association, 49:732–

764, 1954.

[35] L. A. Goodman and W. H. Kruskal. Measure of association for cross classification ii: further discussion and references. Journal of the Amer-ican Statistical Association, 54:123–163, 1959.

[36] Leo A. Goodman and William H. Kruskal. Measures of associa-tion for cross classificaassocia-tions. Journal American Statistical Associaassocia-tion, 49(268):732–64, December 1954.

[37] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In SIGMOD Conference, pages 73–84, 1998.

[38] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Rock: A robust clustering algorithm for categorical attributes. In Proc. of IEEE ICDE 1999, pages 512–521, Sydney, Austrialia, 1999. IEEE Computer Society.

[39] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182, 2003.

[40] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182, 2003.

[41] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Tech-niques (The Morgan Kaufmann Series in Data Management Systems).

Morgan Kaufmann, September 2000.

[42] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Tech-niques (The Morgan Kaufmann Series in Data Management Systems).

Morgan Kaufmann, September 2000.

[43] J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123–129, March 1972.

[44] V. Hatzivassiloglou, J. Klavans, and E. Eskin. Detecting text similar-ity over short passages: Exploring linguistic feature combinations via machine learning. In Proc. of EMNLP, 1999.

[45] Vasileios Hatzivassiloglou, Luis Gravano, and Ankineedu Maganti. An investigation of linguistic features and clustering algorithms for topical document clustering. In ACM SIGIR ’00: Proc. of the 23rd ACM SI-GIR Conference on Research and Development in Information Retrieval, pages 224–231, 2000.

[46] N. A. Heard, C. C. Holmes, D. A. Stephens, D. J. Hand, and G. Di-mopoulos. Bayesian coclustering of anopheles gene expression time se-ries: Study of immune defense response to multiple experimental chal-lenges. Proc. Natl. Acad. Sci., 102(47):16939–16944, 2005.

[47] Alexander Hinneburg, Er Hinneburg, and Daniel A. Keim. An efficient approach to clustering in large multimedia databases with noise. In In proc. of KDD, pages 58–65. AAAI Press, 1998.

[48] Julia Hodges, Shiyun Yie, Ray Reighart, and Lois Boggess. An auto-mated system that assists in the generation of document indexes. Nat.

Lang. Eng., 2(2):137–160, 1996.

[49] Thomas Hofmann. The cluster-abstraction model: Unsupervised learn-ing of topic hierarchies from text data. In IJCAI, pages 682–687, 1999.

[50] Mehdi Hosseini and Hassan Abolhassani. Hierarchical co-clustering for web queries and selected urls. In Proceedings of WISE 2007, volume

BIBLIOGRAPHY 4831 of Lecture Notes in Computer Science, pages 653–662. Springer, 2007.

[51] Jinjie Huang, Yunze Cai, and Xiaoming Xu. A wrapper for feature selection based on mutual information. In ICPR ’06: Proceedings of the 18th International Conference on Pattern Recognition, pages 618–621.

IEEE Computer Society, 2006.

[52] Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283–

304, 1998.

[53] L. Hubert and P. Arabie. Comparing partitions. Journal of Classifica-tion, 2(1):193–218, 1985.

[54] Dino Ienco and Rosa Meo. Exploration and reduction of the feature space by hierarchical clustering. In Proceedings of SIAM 2008, 2008.

[55] Dino Ienco and Rosa meo. Towards automatic construction of con-ceptual taxonomies. In Proceedings of DaWaK 2008, Lecture Notes in Computer Science, pages 653–662. Springer, 2008.

[56] Simon Kasif, Steven Salzberg, David L. Waltz, John Rachlin, and David W. Aha. A probabilistic framework for memory-based reason-ing. Artif. Intell., 104(1-2):287–311, 1998.

[57] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Intro-duction to Cluster Analysis. Wiley, 1990.

[58] Y. Kluger, R. Basri, JT Chang, and M. Gerstein. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Research, 13:703–716, 2003.

[59] Ron Kohavi and George H. John. Wrappers for feature subset selection.

Artificial Intelligence, 97:273–324, 1997.

[60] I. Kononenko and M. Robnik-Sikonia. Relieff for estimation and dis-cretization of attributes in classification, regression and ilp problems, 1996.

[61] Krista Lagus, Timo Honkela, Samuel Kaski, and Teuvo Kohonen. Self-organizing maps of document collections: A new approach to interactive exploration. In In proc. of KDD, pages 238–243. AAAI Press, 1996.

[62] David D. Lewis. Evaluating text categorization. In Proc. Speech and Natural Language Workshop, HLT, 1991.

[63] Tao Li, Sheng Ma, and Mitsunori Ogihara. Entropy-based criterion in categorical clustering. In Proc. of ICML 2004, pages 536–543, 2004.

[64] Gabor Melli. Dataset generator, perfect data for an imperfect world.

http://www.datasetgenerator.com, 2008.

[65] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[66] Ross J. Quinlan. C4.5: Programs for Machine Learning (Morgan Kauf-mann Series in Machine Learning). Morgan KaufKauf-mann, January 1993.

[67] Mihalcea R. Unsupervised large-vocabulary word sense disam-biguation with graph-based algorithms for sequence data labeling.

HLT/EMNLP05, 2005.

[68] C. Robardet. Contribution `a la classification non supervis´ee: proposition d’une methode de bi-partitionnement. PhD thesis, Universit´e Claude Bernard - Lyon 1, juliet 2002.

[69] C. Robardet and F. Feschet. Comparison of three objective functions for conceptual clustering. In Proceedings PKDD’01, volume 2168 of LNAI, pages 399–410. Springer, September 2001.

BIBLIOGRAPHY [70] C. Robardet and F. Feschet. Efficient local search in conceptual clus-tering. In Proceedings DS’01, volume 2226 of LNCS, pages 323–335.

Springer, november 2001.

[71] Marko Robnik-Sikonja and Igor Kononenko. Theoretical and empirical analysis of relieff and rrelieff. Machine Learning, 53:23–69, 2003.

[72] Elena Roglia, Rossella Cancelliere, and Rosa Meo. Classification of chestnuts with feature selection by noise resilient classifiers. In ESANN, pages 271–276, 2008.

[73] Brin S and Page L. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30, 1998.

[74] Mark Sanderson and W. Bruce Croft. Deriving concept hierarchies from text. In Research and Development in Information Retrieval, pages 206–

213, 1999.

[75] E. Segal, D. Koller, and D. Ormoneit. Probabilistic abstraction hierar-chies. In In Proc. NIPS, 2001., 2001.

[76] Noam Slonim and Naftali Tishby. Document clustering using word clus-ters via the information bottleneck method. In Proceedings SIGIR ’00, pages 208–215, New York, NY, USA, 2000.

[77] Craig Stanfill and David L. Waltz. Toward memory-based reasoning.

Commun. ACM, 29(12):1213–1228, 1986.

[78] Alexander Strehl, Joydeep Ghosh, and Claire Cardie. Cluster ensem-bles - a knowledge reuse framework for combining multiple partitions.

Journal of Machine Learning Research, 3:583–617, 2002.

[79] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison Wesley, 1 edition, May 2005.

[80] Juan Torres, Ashraf Saad, and Elliot Moore. Application of a GA/Bayesian filter-wrapper feature selection method to classification of clinical depression from speech data, pages 115–121. Springer-Verlag, 2007.

[81] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995.

[82] L. A. Verdugo and P. N. Rathie. On the entropy of continuous proba-bility distributions. IEEE Transactions on Information Theory, 24:120–

122, January 1978.

[83] Nina Wacholder, Judith L. Klavans, and David K. Evans. Evaluation of automatically identified index terms for browsing electronic documents.

In Proc. of the 6th Conference on Applied natural language processing, pages 302–309, 2000.

[84] J.A. Wise, J.J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, and V. Crow. Visualizing the non-visual: spatial analysis and interaction with information from text documents. infovis, 00:51, 1995.

[85] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learn-ing Tools and Techniques. Morgan Kaufmann, second edition, 2005.

[86] Eric P. Xing, Michael I. Jordan, and Richard M. Karp. Feature selection for high-dimensional genomic microarray data. In In Proceedings of the Eighteenth International Conference on Machine Learning, pages 601–

608. Morgan Kaufmann, 2001.

[87] Yiling Yang, Xudong Guan, and Jinyuan You. Clope: A fast and ef-fective clustering algorithm for transactional data. In Proc. of ACM SIGKDD 2002, pages 682–687, 2002.

BIBLIOGRAPHY [88] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In ICML ’97: Proc. of the Fourteenth International Conference on Machine Learning, pages 412–420, 1997.

[89] Lei Yu and Huan Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of ICML 2003, pages 856–863, Washington, DC, USA, 2003.

[90] Lei Yu and Huan Liu. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res., 5:1205–1224, 2004.

[91] Mohammed J. Zaki and Markus Peters. Clicks: Mining subspace clusters in categorical data via k-partite maximal cliques. In Proc. of IEEE ICDE 2005, pages 355–356, 2005.

[92] Oren Zamir, Oren Etzioni, Omid Madani, and Richard M. Karp. Fast and intuitive clustering of web documents. In SIGACM KDD Confer-ence, pages 287–290, 1997.

[93] Li Zhang, ShiXia Liu, Yue Pan, and LiPing Yang. Infoanalyzer: a computer-aided tool for building enterprise taxonomies. In ACM CIKM

’04: Proc. of the 13th ACM international conference on Information and knowledge management, pages 477–483, 2004.

[94] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an ef-ficient data clustering method for very large databases. In SIGMOD

’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 103–114, New York, NY, USA, 1996.

ACM.

[95] Zheng Zhao and Huan Liu. Spectral feature selection for supervised and unsupervised learning. In ICML ’07: Proceedings of the 24th interna-tional conference on Machine learning, pages 1151–1157. ACM, 2007.