International Journal of Engineering Research in Computer Science and Engineering (IJERCSE)

Community Detection with Semantic based Information Filtering

Author : Jimsy Johnson ¹ Smitha C S ²

Date of Publication :7th February 2016

Abstract: Topic Modelling has been widely used in the fields of machine learning, text mining etc. It was proposed to generate statistical models to classify multiple topics in a collection of document, and each topic is represented by distribution of words. Many mature term-based or pattern based approaches have been used in the field of information filtering to generate users information needs from a collection of documents. The user’s interests involve multiple topics. Latent Dirichlet Allocation (LDA) was used to represent multiple topics in a collection of documents. Polysemy and synonymy are the two prominent problems in document modelling. Nowadays patterns are used for representing topics since they have more discriminative power than words for representing multiple topics in a document. But it is difficult to process the large amount of discovered patterns. So we are trying to find more efficient method for optimizing the pattern generation and trying to create a more accurate user interest modelling. Here we uses Maximum Matched Pattern based Topic Model. And the maximum matched patterns are then passed through an NLP-engine for creating synonyms of the patterns and thus more efficient search is obtained. A community for scholars is created. It is useful for doubt clearance and notifying events in a particular area.

Reference :

1. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77 84.
2. J. Mostafa, S. Mukhopadhyay, M. Palakal, and W. Lam, “A multilevel approach to intelligent information filtering: Model, system, and evaluation,” ACM Trans. Inform. Syst., vol. 15, no. 4, pp. 368– 399, 1997
3. S. E. Robertson and I. Soboroff, “The TREC 2002 filtering track report,” in Proc. TREC, 2002, vol. 2002, no. 3, p. 5.
4. Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, “Adapting ranking svm to document retrieval,” in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 2006, pp. 186–193.
5. F. Beil, M. Ester, and X. Xu, “Frequent term-based text clustering,” in Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2002, pp. 436–442.
6. S.-T. Wu, Y. Li, and Y. Xu, “Deploying approaches for pattern refinement in text mining,” in Proc. 6th Int. Conf. Data Min., 2006, pp. 1157–1161
7. N. Zhong, Y. Li, and S.-T. Wu, “Effective pattern discovery for text mining,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 1, pp. 30–44, Jan. 2012.
8. J. Lafferty and C. Zhai, “Probabilistic relevance models based on document and query generation,” in Language Modeling for Information Retrieval. New York, NY, USA: Springer, 2003, pp. 1–10.
9. L. Azzopardi, M. Girolami, and C. Van Rijsbergen, “Topic based language models for ad hoc information retrieval,” in Proc. Neural Netw. IEEE Int. Joint Conf., 2004, vol. 4, pp. 3281–3286.
10. S. Robertson, H. Zaragoza, and M. Taylor, “Simple BM25 extension to multiple weighted fields,” in Proc. 13th ACM Int. Conf. Inform. Knowl. Manag., 2004, pp. 42–49
11. Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, “Adapting ranking svm to document retrieval,” in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 2006, pp. 186–193.
12. X. Li and B. Liu, “Learning to classify texts using positive and unlabeled data,” in Proc. Int. Joint Conf. Artif. Intell., 2003, vol. 3, pp. 587–592.
13. J. F€urnkranz, “A study using n-gram features for text categorization,” Austrian Res. Inst. Artif. Intell., vol. 3, no. 1998, pp. 1–10, 1998.
14. W. B. Cavnar and J. M. Trenkle, “N-gram-based text categorization,” Ann Arbor MI, vol. 48113, no. 2, pp. 161–175, 1994
15. Y. Xu, Y. Li, and G. Shaw, “Reliable representations for association rules,” Data Knowl. Eng., vol. 70, no. 6, pp. 555–575, 2011
16. J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: Current status and future directions Data Min. Knowl. Discov., vol. 15, no. 1, pp. 55–86, 2007.
17. R. J. Bayardo Jr, “Efficiently mining long patterns from databases,” in Proc. ACM Sigmod Record, 1998, vol. 27, no. 2, pp. 85–93.
18. J.-F. Boulicaut, A. Bykowski, and C. Rigotti, “Freesets: A condensed representation of boolean data for the approximation of frequency queries,” Data Min. Knowl. Discov., vol. 7, no. 1, pp. 5– 22, 2003.
19. A. Bykowski and C. Rigotti, “Dbc: A condensed representation of frequent patterns for efficient mining,” Inform. Syst., vol. 28, no. 8, pp. 949–977, 2003.
20. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993– 1022, 2003.
1. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77 84.
2. J. Mostafa, S. Mukhopadhyay, M. Palakal, and W. Lam, “A multilevel approach to intelligent information filtering: Model, system, and evaluation,” ACM Trans. Inform. Syst., vol. 15, no. 4, pp. 368– 399, 1997
3. S. E. Robertson and I. Soboroff, “The TREC 2002 filtering track report,” in Proc. TREC, 2002, vol. 2002, no. 3, p. 5.
4. Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, “Adapting ranking svm to document retrieval,” in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 2006, pp. 186–193.
5. F. Beil, M. Ester, and X. Xu, “Frequent term-based text clustering,” in Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2002, pp. 436–442.
6. S.-T. Wu, Y. Li, and Y. Xu, “Deploying approaches for pattern refinement in text mining,” in Proc. 6th Int. Conf. Data Min., 2006, pp. 1157–1161
7. N. Zhong, Y. Li, and S.-T. Wu, “Effective pattern discovery for text mining,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 1, pp. 30–44, Jan. 2012.
8. J. Lafferty and C. Zhai, “Probabilistic relevance models based on document and query generation,” in Language Modeling for Information Retrieval. New York, NY, USA: Springer, 2003, pp. 1–10.
9. L. Azzopardi, M. Girolami, and C. Van Rijsbergen, “Topic based language models for ad hoc information retrieval,” in Proc. Neural Netw. IEEE Int. Joint Conf., 2004, vol. 4, pp. 3281–3286.
10. S. Robertson, H. Zaragoza, and M. Taylor, “Simple BM25 extension to multiple weighted fields,” in Proc. 13th ACM Int. Conf. Inform. Knowl. Manag., 2004, pp. 42–49.
11. Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, “Adapting ranking svm to document retrieval,” in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 2006, pp. 186–193.
12. X. Li and B. Liu, “Learning to classify texts using positive and unlabeled data,” in Proc. Int. Joint Conf. Artif. Intell., 2003, vol. 3, pp. 587–592.
13. J. F€urnkranz, “A study using n-gram features for text categorization,” Austrian Res. Inst. Artif. Intell., vol. 3, no. 1998, pp. 1–10, 1998.
14. W. B. Cavnar and J. M. Trenkle, “N-gram-based text categorization,” Ann Arbor MI, vol. 48113, no. 2, pp. 161–175, 1994.
15. Y. Xu, Y. Li, and G. Shaw, “Reliable representations for association rules,” Data Knowl. Eng., vol. 70, no. 6, pp. 555–575, 2011
16. J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: Current status and future directions,” Data Min. Knowl. Discov., vol. 15, no. 1, pp. 55–86, 2007.
17. R. J. Bayardo Jr, “Efficiently mining long patterns from databases,” in Proc. ACM Sigmod Record, 1998, vol. 27, no. 2, pp. 85–93.
18. J.-F. Boulicaut, A. Bykowski, and C. Rigotti, “Freesets: A condensed representation of boolean data for the approximation of frequency queries,” Data Min. Knowl. Discov., vol. 7, no. 1, pp. 5– 22, 2003
19. A. Bykowski and C. Rigotti, “Dbc: A condensed representation of frequent patterns for efficient mining,” Inform. Syst., vol. 28, no. 8, pp. 949–977, 2003.
20. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993– 1022, 2003.