Author : Tejasmani, Guhan, Raja S
Date of Publication :17th September 2025
Abstract: This paper presents a supervised natural language processing approach to detect the geographic region and implied user interests from social media text, specifically YouTube comments from India and China. Using a dataset of 10,000 region-labeled comments, we implement a DistilBERT-based classifier enhanced with data augmentation to address class imbalance and noisy, code-mixed inputs. Our model achieves a test accuracy of 91.2%, with recall above 85% for both regions. The extracted insights on user background enable personalized content recommendations, addressing cold-start challenges in recommendation systems. The study contributes an effective pipeline for region-aware user profiling in multilingual, noisy social media environments.
Reference :