Weather Forecasting - Application of ML

soham Medewar
Last Updated: May 13, 2022


Traditional weather forecasting uses complex physics models with many atmospheric conditions over a long period. However, these models are volatile due to the disturbance of the weather system, which leads to inaccurate predictions. 

This article will explain how the weather is predicted using Machine Learning concepts. Machine Learning is a technique to prepare a model using a training dataset. A model is a formula to predict the output using weights and variables for each training variable.

So let's get started. 


  • Basic knowledge of pandas, numpy, matplotlib, seaborn, statsmodels, tqdm_notebook, product.
  • Must be familiar with ARIMA, SARIMAX concept.


Essential libraries

import pandas as pd
import seaborn as sns

import numpy as np 
import matplotlib.pyplot as plt
import statsmodels.api as sm
from itertools import product
from tqdm import tqdm_notebook
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller, pacf, acf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from import plot_pacf
from import plot_acf
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.stats.diagnostic import acorr_ljungbox

These are some essential libraries that will help in preparing the model.

Dataset Exploration

I will use the GlobalLandTemperaturesByMajorCity dataset available on Kaggle. 

This is the first important step in making the model.

Read the dataset using pandas. 

data = pd.read_csv("GlobalLandTemperaturesByMajorCity.csv")

Below is the dataset structure and first five entries of the dataset.

Let us just check the total number of distinct cities in the dataset. Select any one city for forecasting purposes.

all_cities = data.drop_duplicates(['City'])

We can see that we have a total of 100 distinct cities in our dataset.

Preprocessing, Advanced Visualization, Stationary Check

In this part, we will process the raw data, extract some helpful information, and modify it accordingly. 

As we saw, there are 100 different cities in our dataset. You can select any one of them. In the below model, I will perform forecasting on Ahmadabad city. 

Let us extract all the data of Ahmadabad city.

adi_data = data[data['City'] == "Ahmadabad"]


We are storing all the data of Ahmadabad city in the ‘adi_data’ variable. Below is the output of the first five entries of the dataset and the total size of the dataset.

In the preprocessing part, we must check for the NaN values in the dataset. If there are a lot of NaN values in the dataset, we must replace every NaN value with some standard methods, or else we can drop the values.

exp = (0.10.1)
        labels=['Non NaN elements','NaN elements'], 

We can see that a tiny portion of the dataset has NaN values, so we will drop all those NaN values.

After removing the NaN values from the dataset, let us check the size of the remaining dataset.

adi_data = adi_data.dropna()

There is a sufficient amount of data left to perform forecasting.

Now reset the index of the dataset. If you don't use the index column, you can drop it too, as I found no use here, so I will drop the index column. Change the “dt” column with proper date-time format using the “to_datetime” function.

adi_data = adi_data.reset_index()
adi_data = adi_data.drop(columns = ['index'])
adi_data.dt = pd.to_datetime(adi_data.dt)

This is a modified version of our dataset.

For easier use, let us make separate columns for year, month, weekday, and day.

year = []
month = []
weekday = []
day = []
forin range(len(adi_data)):
adi_data['year'] = year
adi_data['month'] = month
adi_data['weekday'] = weekday
adi_data['day'] = day

Dataset looks like this after making specific changes.

Let us look over the data types in which our data is stored.


Now we will see scatter plot for average temperature values in the data set.

avg_tmp_lst = adi_data['AverageTemperature'].tolist()
year_lst = adi_data['year'].tolist()
month_lst = adi_data['month'].tolist()

The below output shows that the scatter plot has many overlapped values, so it's challenging to analyze the dataset.

Hence, will check if there are any patterns in the dataset by plotting the average temperature for every decade.

I am defining a time series function; it takes input as starting and ending year and gives the output a list of all the average temperatures and their time from starting to ending year.

Using the time series function and matplotlib we will plot the average temperature of four decades starting from 1896 and ending at 1936.

def get_timeseries(start, end):
    time_series_tmp_lst = []
    time_series_yr_lst = []
    forin range(len(avg_tmp_lst)):
        if year_lst[i] >= start and year_lst[i] <= end:
    return [time_series_yr_lst, time_series_tmp_lst]

plt.title("Graph shows the average temperature of 4 decades",fontsize=25,color='maroon')
plt.ylabel("Average Temperature",fontsize=15,color='maroon')
x = get_timeseries(1896,1906)
x = get_timeseries(1906,1916)
x = get_timeseries(1916,1926)
x = get_timeseries(1926,1936)

The below graph shows the average temperature plot for four decades. Each color represents a decade. From this graph, we can understand that it follows a particular pattern every year.

While using ARIMA models, we must ensure that our time series data is stationary. To check whether the selected time-series data is stationary, we must plot autocorrelation and partial autocorrelation graphs.

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig =, ax=ax1,color ='firebrick')
ax2 = fig.add_subplot(212)
fig =, ax=ax2,color='firebrick')

The above graph suggests that the time series data is not stationary. Nonetheless, performing the AD Fuller Test on the entire dataset tells us that the dataset is stationary.

result = adfuller(adi_data.AverageTemperature)
print('ADF Statistic on the first decade: {}'.format(result[0]))
print('p-value: {}'.format(result[1]))
print('Critical Values:')
for key, value in result[4].items():
    print('\t{}: {}'.format(key, value))

But it is true just because we are looking at the entire dataset. In fact, if we analyze a single decade, it is clear that the dataset is absolutely not stationary for the decade period of time.

result = adfuller(adi_data.AverageTemperature[0:10])
print('ADF Statistic on the first decade: {}'.format(result[0]))
print('p-value: {}'.format(result[1]))
print('Critical Values:')
for key, value in result[4].items():
    print('\t{}: {}'.format(key, value))

Machine Learning

For preparing the model, I will consider the time series from 1992 to 2013.

plt.title("Graph shows the average temperature (1992-2013) ",fontsize=25,color='maroon')
plt.ylabel("Average Temperature",fontsize=15,color='maroon')
x = get_timeseries(1992,2013)

Here’s is the plot of average temperature versus year.

Let us split the dataset (from 1992 - 2013) into training and testing data.

dataset = get_timeseries(19922013)
N = len(dataset[0])
split = 0.95
training_size = round(split*N)
test_size = round((1-split)*N)
series = dataset[1][:training_size]
date = dataset[0][:training_size]
test_series = dataset[1][len(date)-1:len(dataset[0])]
test_date = dataset[0][len(date)-1:len(dataset[0])]

Plotting the train and test data.

fig, ax = plt.subplots(figsize=(10,10))
train = ax.plot(date,series, label = "Training Set")
test = ax.plot(test_date,test_series, label = "Test Set")
ax.legend(bbox_to_anchor=(0.1.021..102), loc='lower left', ncol=2, mode="expand", borderaxespad=0.)

Here comes the main algorithm called ARIMA model. ARIMA is also called as autoregressive integrated moving average this model is based on optimization procedure that adopts maximum likelihood function.

def optimize_ARIMA(order_list, exog):
        Return dataframe with parameters and corresponding AIC
        order_list - list with (p, d, q) tuples
        exog - the exogenous variable
    results = []
    seasonal_order = (0000)
    for order in tqdm_notebook(order_list):
            model = SARIMAX(exog, order=order, seasonal_order=seasonal_order).fit(disp=0)
            aic = model.aic
            results.append([order, model.aic])
    result_df = pd.DataFrame(results)
    result_df.columns = ['(p, d, q)''AIC']
    result_df = result_df.sort_values(by='AIC', ascending=True).reset_index(drop=True)
    return result_df

First we will perform operation for zero-differentiated ARIMA models. In the code below, d = 0 is declared for zero-differentiated ARIMA model.

ps = range(0101)
d = 0
qs = range(0101)

# Create a list with all possible combination of parameters
params = product(ps, qs)
paras_list = list(params)

order_list = []

for each in params_list:
    each = list(each)
    each.insert(1, d)
    each = tuple(each)
result_d_0 = optimize_ARIMA(order_list, exog = series)


The result for the zero-differentiated ARIMA model.

Now lets perform the same operation for d = 1 indicates as first-differentiated ARIMA models.

ps = range(0101)
d = 1
qs = range(0101)

# Create a list with all possible combination of parameters
params = product(ps, qs)
paras_list = list(params)

order_list = []

for each in params_list:
    each = list(each)
    each.insert(1, d)
    each = tuple(each)
result_d_1 = optimize_ARIMA(order_list, exog = series)

The result for the first-differentiated ARIMA model.

So let us combine the results from both the operations (zero and first differentiated models). Pick the best model from the combined results.

final_result = result_d_0.append(result_d_1)
bestmodel = final_result.sort_values(by='AIC', ascending=True).reset_index(drop=True).head()
best_model_params_0 = bestmodel[bestmodel.columns[0]][0]
best_model_params_1 = bestmodel[bestmodel.columns[0]][1]


best_model_0 = SARIMAX(series, order=best_model_params_0).fit()
best_model_1 = SARIMAX(series, order=best_model_params_1).fit()

Till now, we have successfully created the model using ARIMA.


Here comes the final part, now we will train the dataset using the prepared model.

fore_l= test_size-1
forecast = best_model_0.get_prediction(start=training_size, end=training_size+fore_l)
forec = forecast.predicted_mean
ci = forecast.conf_int(alpha=0.05)

s_forecast = best_model_1.get_prediction(start=training_size, end=training_size+fore_l)
s_forec = s_forecast.predicted_mean
s_ci = forecast.conf_int(alpha=0.05)


Let us plot our results.

fig, ax = plt.subplots(figsize=(15,8), dpi=300)
x0 = adi_data.AverageTemperature.index[0:training_size]

plt.plot(x0, adi_data.AverageTemperature[0:training_size],'k', label = 'Average Temperature')
plt.plot(adi_data.AverageTemperature[training_size:training_size+fore_l], '.k', label = 'Actual')

forec = pd.DataFrame(forec, columns=['f'], index = x1)

s_forec = pd.DataFrame(s_forec, columns=['f'], index = x1)
s_forec.f.plot(ax=ax,color = 'firebrick',label = 'Forecast  (2,1,6) model')

plt.legend(loc = 'lower left')
plt.xlabel('Index Datapoint')
plt.ylabel('Average Temperature')

The below plot shows our forecast; the black line and dotted lines show the actual temperature, the red line shows the predicted temperature.


  1. What is SARIMAX?
    Seasonal Auto-Regressive Integrated Moving Average with eXogenous factors is an updated version of the ARIMA model.
  2. What is the general use of ARIMA?
    It's a model used in statistics and econometrics to measure events that happen over a period of time.
  3. What are time series models?
    Time series models are used to forecast future events based on previous events that have been observed (and data collected) at regular time intervals. Time series analysis is a useful business forecasting technique.

Key Takeaways

In this blog, we covered: 

  • Importance of Machine Learning in weather forecasting.
  • Use of ARIMA and SERIMAX.
  • Exploring, preprocessing, visualizing dataset.


Hello readers, here's a perfect course that will guide you to dive deep into Machine learning.

Happy Coding!

Was this article helpful ?


No comments yet

Be the first to share what you think