Song Recommender System
back to projects
Completed2021

Song Recommender System

ML-based workout song recommender using BPM and VADER sentiment analysis. Co-authored research with K-Means clustering on Billboard Top 100 to match songs to exercise intensity.

PythonMLNLPSentiment AnalysisK-MeansResearch

A machine learning-based music recommendation system that generates adaptive playlists based on physiological signals and song sentiment — built as a co-authored research project with Tanish Maheshwari.

The core insight: BPM alone isn't enough to match a song to an exercise state. Sentiment analysis on lyrics (using VADER) reveals a neutrality parameter that shows the highest positive covariance with BPM, and combining both gives a more accurate picture of a song's intensity profile than tempo alone.

Billboard Top 100 lyrics were extracted via the Genius API, sentiment scores computed using VADER, and K-Means clustering (optimised to K=4 via the elbow method) grouped songs into workout intensity tiers: warm-up, intensity, aggressive-1, and aggressive-2. The recommender takes live BPM from a smartwatch, maps it to the correct cluster, and shuffles a song from that tier.

Key Highlights

  • Co-authored research project with Tanish Maheshwari (Presidency University)
  • Lyrics extracted from Billboard Top 100 via Genius API and lyrics-extractor library
  • VADER sentiment analysis (rule-based NLP) to score song polarity across positive, negative, neutral, compound dimensions
  • Neutrality identified as the feature with highest positive covariance with BPM — used as primary clustering feature
  • K-Means clustering optimised to K=4 clusters via WCSS elbow method: warm-up, intensity, aggressive-1, aggressive-2
  • Live BPM input via smartwatch mapped to cluster range for real-time song selection

Tech Stack

ML

Scikit-learnK-MeansHierarchical Clustering

NLP

VADER Sentimentlyrics-extractorGenius API

Data

PandasNumPyMatplotlibBillboard Kaggle dataset

Challenges

  • Noisy heart rate data from consumer wearables affecting cluster boundary accuracy
  • Subjectivity of workout intensity — same BPM feels different across fitness levels
  • Cold start problem: new users have no preference history to anchor recommendations

Key Learnings

  • Combining physiological signals with content features produces more robust clusters than either alone
  • VADER's neutrality parameter is a stronger covariate with tempo than positive or negative polarity
  • Elbow method optimisation is critical — initialising K=5 without validation produced noisier clusters