This project focuses on predicting the happiness score of different countries using Machine Learning techniques, exploratory data analysis (EDA), feature selection, real-time data streaming with Kafka, and storage in PostgreSQL.
- Project Description
- Prerequisites
- Project Structure
- Environment Setup
- Data Preparation
- Model Training
- Kafka and PostgreSQL Implementation
- Kafka Producer and Consumer Execution
- [Evidence of Results
- Conclusion
This project seeks to predict the happiness index of different countries based on characteristics such as GDP per capita, healthy life expectancy, personal freedom, among other factors. To do this:
- We process happiness data from different years.
- We apply feature selection techniques.
- We train a regression model.
- We use Apache Kafka to transmit the predictions in real time.
- We store the results in PostgreSQL.
- Python 3.8+
- Docker Desktop for container deployment.
- Kafka and PostgreSQL.
- Jupyter Notebook for EDA and model training.
- Install the following Python packages:
pip install pandas scikit-learn sqlalchemy kafka-python python-dotenvAll dependencies used are in the requirements.txt file.
The structure of this project is as follows:
Contains the data files at different stages of processing.
raw/: Original data for each year.processed/: Processed data.clean/: Cleaned data ready for analysis.
Contains the files related to the trained prediction model.
final_happiness_model.pkl: Trained prediction model that predicts the happiness score.
Contains the source code for data processing and streaming.
kafka_producer.py: Kafka producer to send predictions through Kafka.kafka_consumer.py: Kafka consumer to store predictions in PostgreSQL.
Contains the Jupyter notebooks used for model analysis and training.
eda.ipynb: Exploratory analysis of data from 2015 to 2019.model_training.ipynb: Training of the regression model to predict the happiness score.
README.md: This file contains the project documentation.docker-compose.yml: Configuration of the services for the Kafka broker and ZooKeeper.requirements.txt: Dependencies required to run the project (includes libraries such aspandas,scikit-learn,kafka-python, among others).
python -m venv venv
activate with :
source venv/scripts/activate
Create a .env file in the project root for the PostgreSQL database credentials:
DB_HOST=localhost DB_NAME=Happiness DB_USER=postgres DB_PASSWORD=root DATABASE_URL=postgresql://user:password@localhost/database_name
The original data is in data/raw. We clean and standardize country names, remove null values.
We open the following notebooks in order and run the cells to perform the cleaning process and save the clean csv files:
notebooks/EDA_2015.ipynb notebooks/EDA_2016.ipynb notebooks/EDA_2017.ipynb notebooks/EDA_2018.ipynb notebooks/EDA_2019.ipynb
After running them, you should have the clean files in the data/clean folder. Now we run the following notebook to add the region column to the 2017, 2018, and 2019 datasets to later concatenate the data:
notebooks/merge.ipynb
Open the notebook notebooks/model_training.ipynb and run the cells to:
Perform feature selection. Train a regression model to predict the happiness score. Evaluate and save the model with a satisfactory R² (at least 0.80). Save the trained model to models/final_happiness_model.pkl.
Running the Container
We need to have the Docker desktop application open on our computer and run the command in a git bash in our project
docker-compose up -dThis code will start the container, we can verify it with:
docker psNow we will run the following command to create a topic called happiness_predictions:
docker exec -it happiness-countries-kafka-1 kafka-topics.sh \
--create \
--topic happiness_predictions \
--bootstrap-server localhost:9092 \
--partitions 1 \
--replication-factor 1Now we make sure that the topic was created with this command:
docker exec -it happiness-countries-kafka-1 kafka-topics.sh \
--list \
--bootstrap-server localhost:9092We run the producer script to send predictions to the happiness_predictions topic:
python src/kafka_producer.pyIn another terminal, we run the consumer to read the messages and save them in PostgreSQL:
python src/kafka_consumer.pyVerify the predictions in PostgreSQL:
SELECT * FROM happiness_predictions;This project provides a complete solution to predict and store happiness scores at a global level, integrating Machine Learning and real-time streaming systems. The architecture built is scalable and allows continuous analysis based on updated happiness data.
William Alejandro Botero Florez
This README.md has detailed and well-structured instructions that make it easy to navigate and execute the project, covering everything from prerequisites to final implementation.