Machine learning to predict day zero
A project completed as part of the EECS:349 Machine Learning course to spread awareness about rapid depletion of water resources by predicting Day Zero for a country. Day Zero refers to a situation of acute water shortage. The initial stage of this project involved building up a large dataset (1968 examples), with 11 attributes, shown in the table below. We chose attributes like rainwater harvesting awareness, water consumption per capita, and desalination capacity. The dataset contained information for 180 countries from 1960 to 2014, sourced from the AQUASTAT database. After this, we trained a machine learning model on this dataset, and with the stress level as the target attribute, we used the best model to predict the stress levels of a country.
| Attribute | Unit | Description |
|---|---|---|
| Rainwater Harvesting Awareness |
Yes/No | Determined by whether or not rainwater harvesting is widely practiced |
| Water Consumption per Capita | m^3/year/inhabitant | Total amount of water withdrawn per capita |
| Desalination Capacity | km^3/year | Fresh water produced using brackish or salt water |
| Water Dependency Ratio | % | Percentage of water that comes from other countries |
| Agricultural Water Withdrawal | % | Percentage of total water withdrawn used for agriculture |
| Industrial Water Withdrawal | % | Percentage of total water withdrawn used for industrial purposes |
| Municipal Water Withdrawal | % | Percentage of total water withdrawn used for municipal purposes |
| Water Stress Level | % | Water stress level measured by dividing total water withdrawal by the total water available minus any water needed for environmental flow. This was used to determine the class label for each sample |
| Total Land Cultivated | % | Percentage of the total land area of the country that has been cultivated |
| Annual Precipitation | mm/yr | Total depth of precipitation per year |
| Total Renewable Water Resources per Capita |
m^3/year/inhabitant | The maximum theoretical yearly amount of water available per person for a country at a given moment |
weka algorithm vs. classification accuracy
programming skills
Using the dataset, we used Weka to find the right algorithm to build a model for our data. The success rates for the different models are shown in the graph above. We found out that nearest neighbor (IBk) produced the best results (with 88.26% classification accuracy). Using the model built with the nearest neighbor algorithm, we predicted stress levels for all the countries. Then, with scikit-learn, we performed linear regression on the data to predict when each country’s water stress level would cross a critical level. This project was completed by myself and Aamir Husain (MSR ‘18).
Try out our interactive website here! The final project report can be downloaded from here.