Linear Fitting#
Problem: Find a
and b
in the equation y=a+b×x
, given the vectors x
and y
, using the least-squares method. Also, determine the corresponding R-squared value.
Test data:
x←⍳25
y←10+5×x
]plot y x
This data is too perfect. Let’s add some noise:
y+←(5-⍨10×?0⍴⍨≢y)
]plot y x
The slope is easy to determine if we already know what is the value of a
:
⊢b←(y-10)⌹x
4.900895018
We can also use ⌹
to fit several variables. To fit the value of a
, we consider an additional variable with a constant value of 1, as explained in the docuentation for ⌹
. We just need to modify our right argument from
x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
into
1,⍪x
1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 1 20 1 21 1 22 1 23 1 24 1 25
(a vector is treated by ⌹
as a column vector). Then:
⊢(a b)←y⌹1,⍪x
8.968943213 4.961545417
Fitting function#
Let’s write a function to perform linear fittings. Our function will take as right argument the y x
data and will be ambivalent (we use 0=⎕NC'⍺'
to check if a left argument was passed to the function). If there is a left argument, it will be interpreted as the value of the ordinate at the origin. So, we will fit the slope of the line through the origin using the data ⍵-⍺ 0
. If no argument is given, a
and b
will be fitted. In both cases, the function will return the values a b
.
LF←{0=⎕NC'⍺': ⊃⌹∘(1,⍪)/⍵ ⋄ ⍺,⌹/⍵-⍺ 0} ⍝ linear fit. ⍵: y x data. ⍺ (optional): ordinate
⍝ eg
⊢(ad bd)←10 LF y x
⊢(a b )← LF y x
]plot ((ad+bd×x) (a+b×x) y y) x
10 4.900895018
8.968943213 4.961545417
R-squared value#
We still need to determine the R-squared value, which can be calculated as one minus the ratio of the sums of squares (÷⍥(+/×⍨)
) of differences between the real and estimated values and between the real values and their mean:
1-(y-a+b×x)÷⍥(+/×⍨)(⊢-+⌿÷≢)y ⍝ R-squared value
0.9971435494
The R-squared value is defined not only for linear fittings, but also for any other fitting we perform on some data. Therefore, we can define an operator that takes the fitted function as operand (in our case, {a+b×⍵}
or a+b∘×
) and the fitting data as arguments.
_R2←{1-(⍺-⍺⍺⍵)÷⍥(+/×⍨)(⊢-+/÷≢)⍺} ⍝ returns R-squared value for function ⍺⍺. ⍺ ⍵ is the y x data
⍝ eg
y (a+b∘×)_R2 x
0.9971435494
More generally, we can define a dyadic operator _R2_
that directly takes the fitting and fitted functions and returns the fitted parameters and R-squared value for the given data (as right argument) and optional parameters for the fitting function (as left argument). We then use this operator to define an LR2
function that performs a linear fitting and returns the fitted parameters and the R-squared value.
_R2_←{⍺←⊢ ⋄ r←⍺ ⍺⍺ ⍵ ⋄ r,r∘⍵⍵_R2/⍵} ⍝ returns fitted parameters and R-squared value
⍝ ⍺⍺: fitting function, ⍵⍵: fitted function
⍝ ⍵: fitting data, ⍺ (optional): left argument of fitting function
LR2←LF _R2_{(a b)←⍺ ⋄ (a+b∘×)⍵} ⍝ linear fitting with y x data ⍵ and (optional) ordinate ⍺
⍝ eg
⊢ad bd _←10 LR2 y x
⊢az bz _← 0 LR2 y x
⊢a b _← LR2 y x
]plot ((ad+bd×x) (az+bz×x) (a+b×x) y) x
10 4.900895018 0.996948701
0 5.489130312 0.9823995937
8.968943213 4.961545417 0.9971435494
Obviously, the R-squared value is higher when we fit both values, since the fitted function can accomodate better the fitted data. However, it is useful to have an optional argument to impose already known values, as it will often be the case for functions that are known to pass by the origin.
Summary#
_R2←{1-(⍺-⍺⍺⍵)÷⍥(+/×⍨)(⊢-+/÷≢)⍺} ⍝ R-squared value for ⍺⍺ fitted function
_R2_←{⍺←⊢ ⋄ r←⍺ ⍺⍺ ⍵ ⋄ r,r∘⍵⍵_R2/⍵} ⍝ fit and R-squared value for fitting and fitted functions ⍺⍺ and ⍵⍵
LF←{0=⎕NC'⍺': ⊃⌹∘(1,⍪)/⍵ ⋄ ⍺,⌹/⍵-⍺ 0} ⍝ linear fitting function (⍵: y x data, ⍺: optional ordinate)
LR2←LF _R2_{(a b)←⍺ ⋄ (a+b∘×)⍵} ⍝ linear fitting and R-squared value (⍵: y x data, ⍺: optional ordinate)