Data Mining II

This project is an extension of the Data Mining I project. This project demonstrates advanced data mining techniques applied across multiple domains including time series analysis, classification, regression, and model explainability. Working with two interconnected tabular datasets, obtained through the spotify API, containing track features such as danceability, energy, acousticness, and artist metadata including popularity and follower counts, me and my team developed a complete data science pipeline from data understanding through advanced modeling and interpretation.

The datasets were merged to create an enriched view that combines track-level audio features with aggregated artist statistics. After extensive preprocessing including handling duplicate records with multiple genres, creating derived features like BeatPerMinute, and ensuring data quality through semantic consistency checks, the final dataset contained nearly 90,000 observations with 36 features.

Additionally, we worked with time series data representing 9,860 songs across 20 different genres, applying dimensionality reduction techniques and exploring temporal patterns in music.

This project provided extensive hands-on experience with advanced data mining methodologies and deepened my understanding of the complete machine learning workflow. Working with both tabular and time series data taught me the importance of careful data preparation and the significant impact that preprocessing decisions have on downstream analysis.

I gained practical expertise in time series approximation techniques, learning that PAA generally outperformed SAX for our music classification tasks, and understanding when to apply different distance metrics like DynamicTimeWarping and DerivativeDynamicTimeWarping . The classification experiments revealed that model performance is highly dependent on proper hyperparameter tuning and that ensemble methods like Random Forest and Gradient Boosting consistently delivered the best results, achieving accuracies around 22% on the challenging multi-class key prediction task.

The imbalanced learning section was particularly interesting, teaching me various resampling strategies and their trade-offs. I learned that Random Undersampling, despite its simplicity, can be effective when computational resources are limited. The neural network experiments highlighted the constant challenge of balancing model complexity with overfitting, and the importance of early stopping and regularization techniques.

Perhaps most valuable was the explainability work using LIME , LORE , and SHAP . These tools demonstrated that even with relatively modest prediction accuracy, we can extract meaningful insights about feature importance and model decision-making.

Finally, the project reinforced the importance of rigorous experimental methodology, including proper train-test splitting, cross-validation, and comprehensive metric reporting. The experience of working through challenges like computational constraints, class imbalance, and high-dimensional data taught me valuable problem-solving skills that extend beyond specific algorithms to general data science practice.