Open Access Journal

ISSN : 2394-2320 (Online)

International Journal of Engineering Research in Computer Science and Engineering (IJERCSE)

Monthly Journal for Computer Science and Engineering

Open Access Journal

International Journal of Engineering Research in Computer Science and Engineering (IJERCSE)

Monthly Journal for Computer Science and Engineering

ISSN : 2394-2320 (Online)

Efficient Document Classification using Phrases Generated by Semi-Supervised Hierarchical Latent Dirichlet Allocation

Author : Rohit Agrawal 1 A.S. Jalal 2 S.C. Agarwal 3 Himanshu Sharma 4

Date of Publication :14th February 2018

Abstract: There are many models available for document classification like Support vector machine, neural networks and Naive Bayes classifier. These models are based on the Bag of words model. Word’s semantic meaning is not contained by such models. Meanings of the words are better represented by their occurrences and proximity of words in particular document. So, to maintain the proximity of the words, we use a “Bag of Phrases” model. Bag of phrase model is capable to differentiate the power of phrases for document classification. We proposed a novel method to separate phrases from the corpus utilizing the outstanding theme show, Semi-Supervised Hierarchical Latent Dirichlet Allocation (SSHLDA).SSHLDA integrates the phrases in vector space model for document classification. Experiment represents an efficient performance of classifiers with this Bag of Phrases model. The experimental results also show that SSHLDA is better than other related representation models.

Reference :

    1. C. J. Burges, “A tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery, 2:121–167, 1998.
    2. B. Dasarathy, “Nearest neighbor (fNNg) norms:fNNgpattern classification techniques” 1991.
    3. T. Joachims, “A probabilistic analysis of the rocchio algorithm with tfidf for text categorization”,Journel of Machine Learning Research, 14:143–151, 1997.
    4. I. Androutsopoulos, J. Koutsias, and Chandrinos, “An evaluation of naive bayesian anti-spam filtering”,Arxiv preprint cs/0006013, 2000.
    5. D. M. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation”,.Journel of Machine Learning Research, 3:993– 1022, 2003.
    6. D. Wang, M. Thint, and A. Al-Rubaie, “SemiSupervised Latent Dirichlet Allocation and its Application for Document Classification,” IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, IEEE, Vol. 03, pp. 306-310, 2012.
    7. W. Zhang, T. Yoshida, and X. Tang, “Text classification based on multi-word with support vector machine. Knowledge-Based Systems”, Elsevier, 21:879– 886, 2008.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
    8. D.Gujraniya, and M.N.Murty, "Efficient classification using phrases generated by topic models." Proceedings of the 21st International Conference on Pattern Recognition, IEEE, pp. 2331-2334, 2012
    9. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, “ Indexing by latent semantic analysis”, Journal of the American society for information science, 41(6):391–407, 1990.
    10. T. Hofmann, “Probabilistic latent semantic analysis” In Proc. of Uncertainty in Artificial Intelligence, UAI’99, page 21. Citeseer,1999.
    11. C. Chemudugunta, P. Smyth, and M. Steyvers,”Text modeling using unsupervised topic models and concept hierarchies” Arxiv preprint arXiv:0808.0973, 2008.
    12. D. Blei and J. Lafferty, “Correlated topic models”, Advances in neural information processing systems, 18:147, 2006
    13. D.M. Blei and J.D. McAuliffe, “Supervised topic models”, In Proceeding of the Neural Information Processing Systems(nips),2007.
    14. S. Lacoste-Julien, F. Sha, and M.I. Jordan,”ndisclda: Discriminative learning for dimensionality reduction and classification”,Advances in Neural Information Processing Systems, 21,2008.
    15. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth,” The author-topic model for authors and documents”, In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487–494. AUAI Press,2004.
    16. D. Ramage, C.D. Manning, and S. Dumais,” Partially labeled topic models for interpretable text mining”, In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 457–465. ACM, 2011.
    17. D. Ramage, D. Hall, R. Nallapati, and C.D. Manning, “Labeledlda: A supervised topic model for credit attribution in multi-labeled corpora”, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248– 256. Association for Computational Linguistics, 2009.
    18. D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum., “Hierarchical topic models and the nested chineserestaurant process”, Advances in neural information processing systems, 16:106,2004.
    19. W. Li and A. McCallum, “Pachinko allocation: Dagstructured mixture models of topic correlations”, In Proceedings of the 23rd international conference on Machine learning, pages 577–584. ACM, 2006.
    20. D. Mimno, W. Li, and A. McCallum,” Mixtures of hierarchical topics with pachinko allocation”, In Proceedings of the 24th international conference on Machine learning, pages 633–640. ACM,2007.
    21. Y. Petinot, K. McKeown, and K. Thadani, “A hierarchical model of web summaries”, In Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies: short papers-Volume 2, pages 670–675. ACL, 2011.

Recent Article