WATER POTABILITY PREDICTION WITH MACHINE LEARNING

The machine predicts whether the water is safe to consume or not

Semparudhi
4 min readJun 19, 2021
Photo by Pixa Karma on Unsplash

Water is a very essential resource for the survival of every living being in the world. According to research, we can survive without food for 3 weeks, but we can’t last 3 days without water. Although water is essential for us to live, all the water in the world is not drinkable for example, the ocean. If our drinking water contains dissolved salts in high concentration it may cause severe side effects, kidney problems, or even lead to death, if consumption is long term. Water in the different regions has different properties. So for that let’s make an ML program that can predict whether the water is potable or not. In this dataset, there are certain parameters like pH, hardness, sulfates…. which can be taught to our machine to predict the water sample is safe to drink or not.

Okay, let's dive into the project

For doing this project we need Jupyter notebook, to know what is Jupyter notebook and how to install it check out my previous story

Open a new file in jupyter notebook.

1.IMPORTING THE REQUIRED LIBRARIES

Here we import all the required libraries and modules like pandas, sci-kit learn, matplotlib, seaborn, NumPy, and xgboost.
(I added embedded links to them to know their function )

2.IMPORT DATASET

here our dataset is a CSV file containing the properties of water samples

Droping the missing values because water quality is a very sensitive data, we cannot tamper with the data by imputing mean, median, mode

dataset

this is the graphical representation of the dataset
this describes the correlation of the data

3.DATA PROCESSING - SPLITTING THE DATA INTO TRAIN AND TEST SETS

Before we teach our machine we need to process the data. Here we split the data into data inputs (X has the properties of water) and data outputs (y show whether the water is consumable or not). In these sets, it is further divided as train-test sets in 90%-10% proportion. This is our data structure.

we also have to scale our data for certain model evaluations.

4.MODEL EVALUATION

To improve the accuracy I have tried with different types of the classifiers

HYPERPARAMETER TUNING

Hyperparameter tuning is a process in which a set of optimal hyperparameters for a learning algorithm is chosen before the learning process begins.

To do so let's assign the classifiers as follows

For parameter tuning, we use grid search cross-validation function from
Sci-kit learn module on each classifiers and define the modules.

i) KNeighbors Classifier

ii) Decision Tree Classifier

iii) Random Forest Classifier

iv) AdaBoost Classifier

v) XGB Boost classifier

Best parameters

Now we have our best parameters

BAGGING AND BOOSTING

Bagging and boosting is an ensemble learning method for improving the model predictions of any given learning algorithm.

Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the performance of a model like classification, prediction, function approximation, etc.

we apply bagging on the Decision tree classifier for testing.

5.TRAINING AND TESTING

We have all our models ready, Now its time to teach the machine and test its accuracy

6.TESTING THE ACCURACY

Out of all the XGB boost classifier has the best performance

I hope you all understand the concepts, it may look like a complex one but it’s not. If a machine can test the potability of water it can even test the water in mars and check whether the water can be used for agriculture, industrial purposes. Who knows what will the future surprise us with?

--

--

No responses yet