WATER POTABILITY PREDICTION WITH MACHINE LEARNING
The machine predicts whether the water is safe to consume or not
Water is a very essential resource for the survival of every living being in the world. According to research, we can survive without food for 3 weeks, but we can’t last 3 days without water. Although water is essential for us to live, all the water in the world is not drinkable for example, the ocean. If our drinking water contains dissolved salts in high concentration it may cause severe side effects, kidney problems, or even lead to death, if consumption is long term. Water in the different regions has different properties. So for that let’s make an ML program that can predict whether the water is potable or not. In this dataset, there are certain parameters like pH, hardness, sulfates…. which can be taught to our machine to predict the water sample is safe to drink or not.
Okay, let's dive into the project
For doing this project we need Jupyter notebook, to know what is Jupyter notebook and how to install it check out my previous story
Open a new file in jupyter notebook.
1.IMPORTING THE REQUIRED LIBRARIES
Here we import all the required libraries and modules like pandas, sci-kit learn, matplotlib, seaborn, NumPy, and xgboost.
(I added embedded links to them to know their function )
2.IMPORT DATASET
Droping the missing values because water quality is a very sensitive data, we cannot tamper with the data by imputing mean, median, mode
dataset
3.DATA PROCESSING - SPLITTING THE DATA INTO TRAIN AND TEST SETS
Before we teach our machine we need to process the data. Here we split the data into data inputs (X has the properties of water) and data outputs (y show whether the water is consumable or not). In these sets, it is further divided as train-test sets in 90%-10% proportion. This is our data structure.
we also have to scale our data for certain model evaluations.
4.MODEL EVALUATION
To improve the accuracy I have tried with different types of the classifiers
HYPERPARAMETER TUNING
Hyperparameter tuning is a process in which a set of optimal hyperparameters for a learning algorithm is chosen before the learning process begins.
To do so let's assign the classifiers as follows
For parameter tuning, we use grid search cross-validation function from
Sci-kit learn module on each classifiers and define the modules.
Best parameters
Now we have our best parameters
BAGGING AND BOOSTING
Bagging and boosting is an ensemble learning method for improving the model predictions of any given learning algorithm.
Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the performance of a model like classification, prediction, function approximation, etc.
we apply bagging on the Decision tree classifier for testing.
5.TRAINING AND TESTING
We have all our models ready, Now its time to teach the machine and test its accuracy
6.TESTING THE ACCURACY
Out of all the XGB boost classifier has the best performance
I hope you all understand the concepts, it may look like a complex one but it’s not. If a machine can test the potability of water it can even test the water in mars and check whether the water can be used for agriculture, industrial purposes. Who knows what will the future surprise us with?