This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.
Pipeline is consists of 10 general steps
- Exploratory Data Analysis (Univariate, Bivariate, Hypothesis testing, Confident Interals)
- Missing values (different advanced and not strategies to impute: MICE algo with the using of gradient boosting, lightgbm etc.)
- Duplicate checking
- Advanced Anomaly Detection (models such as KNN, Isolation Forests, and final detector witch aggregates results from base models – SUOD)
- Multicollinearity problem solving
- Feature Engineering
- Feature Transformation of some features with hypothesis testing on it (fitting distributions with some statistical tests)
- Advanced Feature Selection and not – Recursive Feature Elimination with cross-validation on different tree-based models such as Gradient Boosting, Random Forests etc) and of course Lasso with L1-norm, Feature Importances of trees and combine them into one algorithm witch takes in account all the above method
- Modeling (different regression models, fine-tuning, learning curves, validation curves, Residuals Analysis etc.). Later, i wan’t to use some stacking stategies on boosted trees and some NN models
- Results analysis: best model selection with the using of confident intervals and different non-parametric statistical tests etc.
This solution also contains custom preprocessing pipeline witch automaticly can do 2-8 steps ( all in ? )