Predicting Database Growth with Machine Learning

When managing a small amount of databases, a simple database growth chart is an effective tool for capacity planning. However, as an environment grows into the 100s or 1000s of databases, performing chart based capacity planning can be very tedious.

Machine learning is an effective tool for automatically calculating growth trends and predicting database growth.

This post is ported from a Python notebook. The original notebook (and data) can be found in my portfolio repository

Data Preparation

Import common libraries.

import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import cross_val_score, train_test_split
import matplotlib.pyplot as plt

Here we will import a CSV file of database size readings. This CSV is an extract from an enterprise monitoring tool and is used for this demonstration only. In production, you should consider connecting to your enterprise monitoring tool directly.

data = pd.read_csv('data/DiskReads.csv')
data.head()
ReadDate GBUsed
0 4/10/2019 9193
1 4/9/2019 9184
2 4/8/2019 9183
3 4/7/2019 9183
4 4/6/2019 9184

Dates can prove problematic for machine learning. An effective approach in this example is to convert a date to an integer value where:

  • Zero is today
  • A negative integer corresponds to the number of days in the past
  • A positive integer corresponds to the number of days in the future

Using the datetime libary, we can calculate these integer values.

from datetime import date, datetime, timedelta
# Calculate the delta
data['DaysBack'] = (pd.to_datetime(data['ReadDate']) - datetime.today()).dt.days

data.head()
ReadDate GBUsed DaysBack
0 4/10/2019 9193 -3
1 4/9/2019 9184 -4
2 4/8/2019 9183 -5
3 4/7/2019 9183 -6
4 4/6/2019 9184 -7

Model Fit and Scoring

With the data in an appropriate format, we can fit and score our model.

# Assign the appropriate columns
X = data['DaysBack'].values.reshape(-1,1)
y = data['GBUsed'].values

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit a linear regression model
lm = linear_model.LinearRegression()  
lm.fit(X_train, y_train)

# Score the model on the test data
lm.score(X_test,y_test)
0.9934231470880599

A linear regression scores very well on this example. Graphing our data and the model we can see that this database has very consistent an predictable growth.

plt.plot(X, y, 'o')
plt.plot(X, lm.coef_ * X + lm.intercept_, '-', color='red', label='Trend')
plt.show()

output_10_0

A prediction can be made about when the database will completely fill the drive on the server. The drive in question is 10TB.

Utilizing the equation for a line, instead of solving for y we will solve for x, where:
x corresponds to the day we will hit capacity based on current growth rate
y corresponds to drive capacity in GB
m is the slope of our regression line, provided by the model via lm.coef_
b is the intercept of the regression line, also provided by the model via lm.intercept_

# The database drive is 10TB
drive_capacity = 10000

# Instead of solving for y, solve for x
# y = mx + b
# x = (y - b) / m
days_until_capacity = (drive_capacity - lm.intercept_) / lm.coef_

# days_until_capacity is the number of days until we hit drive capacity
print ("Days Until Capacity:", days_until_capacity)

# Perform a time delta to determine the calendar date
capacity_date = date.today() + timedelta(days=days_until_capacity[0])
print ("Reach Capacity On: ", capacity_date)

Days Until Capacity: [138.86209144]
Reach Capacity On:  2019-08-28

We will reach drive capacity in 138 days, which corresponds to August 28th, 2019. We can now plot the data, model, and the predicted day we will reach capacity.

plt.plot(X,y, 'o')
plt.plot(X, lm.coef_ * X + lm.intercept_, '-', color='red', label='Trend')
plt.plot(days_until_capacity, drive_capacity, 'x', color='green')
plt.show()

output_14_0

Conclusion

Using machine learning we can automatically trend database growth and calculate future resource needs. Using machine learning has several advantages over manually viewing growth charts including scalability, efficiency, and consistency.

While this is not a sophisticated use of machine learning, it is an effective method of capacity planning database infrastructure at scale.

Next steps:

  • Remove the CSV file from the pipeline and connect directly to an enterprise metrics database
  • Loop through an inventory of servers (and drives) and fit a model to each drive
  • Filter the output of this application, showing only high scoring models with upcoming capacity concerns

1 thought on “Predicting Database Growth with Machine Learning”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.