This article will introduce data science by presenting an essential method: linear regression.
It’s a method used when two types of continuous numeric data correlate. Typical examples of data that correlate are the size of a flat and its price, the amount of time spent studying and test scores, the number of years at work and the salary, the sales of a product and the amount spent on advertising.
So, we have two types of values, one of which will help predict the other.
Standard terms used are dependent variables, the variables we’re trying to find,
and independent variables for the variables that help us predict it.
When thinking about mathematical functions, the x
is the independent variable.
The function returns y
, which is the dependent variable.
Creating a linear regression between salary and seniority,
we understand that seniority is independent and will help us predict a salary ( the dependent variable).
The linear regression equation is y=mx+b
.
- y is the dependent variable.
- x is the independent variable.
- m is the slope of the line.
- b is the y-intercept (where the line crosses the y-axis).
Calculating the slope is a tricky bit.
The slope measures how much the dependent variable y
is expected to change
for a one-unit change in the independent variable x
.
A positive slope indicates a positive relationship,
while a negative slope indicates a negative one.
The calculation involves comparing each data point to the average values
and finding the average rate of change across all points.
The steps are:
- Calculate the Means of
x
andy
- Calculate the Differences for each point - (
x
-mean of x
) and (y
-mean of y
) - Calculate the products of differences for each point - multiply the two values found in the step before
- Sum the products - add up all the products calculated in the previous step
- Calculate the squared differences for each point: (
x
-mean of x
) ** 2 - Sum the squared differences: add up all the squared differences from the previous step
- Divide the sum of the products by the sum of the squared differences
Let’s get to code. I have a dataset of irises ( the flower) in a CSV. The iris dataset is one of the famous datasets for introducing data science. With it, I will do a linear regression to guess the width of a petal if I know its length.
I will use the gem polars-df
. It’s a library that allows the use dataframe in ruby.
Dataframes are now constant in data science.
The most famous library is panda
used in python.
Polars is a new dataframe library written in rust and more performing than panda
.
A dataframe is essentially a spreadsheet data structure.
iris_df = Polars.read_csv("iris.csv")
puts iris_df.describe
and this prints:
We can already see the power of that library as it tells us quite a bit about this dataset.
Let’s build our linear regression:
mean_petal_length = iris_df['petal_length'].mean # I'm getting the mean of the petal length values
mean_petal_width = iris_df['petal_width'].mean # I'm getting the mean of the petal width values
# Here I'm doing the steps 2, 3, 4
numerator = ((iris_df['petal_length'] - mean_petal_length) * (iris_df['petal_width'] - mean_petal_width)).sum
# Here I'm doing the steps 5 and 6
denominator = ((iris_df['petal_length'] - mean_petal_length) ** 2).sum
# and this is the step 7
slope = numerator.to_f / denominator.to_f
Now that I have the slope, I can have the intercept:
intercept = mean_petal_width - (slope * mean_petal_length)
To check how it all worked out, I spun a sinatra app where I created a series of points representing that linear regression (which will be a straight line) and compared it with all the petals.
I’ve created a line of points like this:
@x_values = (iris_df['petal_length'].min..iris_df['petal_length'].max).step(0.1).to_a
@y_values = @x_values.map { |x| slope * x + intercept }
And this is the two charts:
We can see how the line approximates quite nicely with the points. So now, if I know the length of an iris, I can predict its width.
Two things to point out at this point:
- Linear regressions do not work for every variable.
- When they do, we can calculate a margin of error.
That’s something I will present in the following article.