Skip to content
Snippets Groups Projects
Commit 85ed7d84 authored by Keyser, Johannes's avatar Keyser, Johannes
Browse files

Some more cleanup, removed redundant check.

parent 50373e6a
No related branches found
No related tags found
No related merge requests found
......@@ -14,17 +14,16 @@ The resulting estimates of slope and intercept are relatively insensitive to out
The implementation in [TheilSen.m](TheilSen.m) is exact but naïve:
It generates the set of all pairs of the _n_ input samples, resulting in an overall complexity of _O(n²)_ in speed and space.
The resulting slope and offset are the median slope and offset of the lines defined by all data point pairs.
(Note that alternative implementations of the algorithm have lower complexity, and are thus much faster for large amounts of input samples.)
### No toolbox required
This code is based on [Theil-Sen Robust Linear Regression](https://mathworks.com/matlabcentral/fileexchange/48294-theil-sen-robust-linear-regression), version 1.2.0.0, by Zachary Danziger.
A key modification is to use `median(X, 'omitnan')` instead of `nanmedian(X)` to avoid dependency on the (commercially licensed) [Statistics Toolbox](https://mathworks.com/products/statistics.html).
See the [changelog](#changelog) below for further modifications.
(Note that there are several other implementations on Mathworks's [File Exchange](https://mathworks.com/matlabcentral/fileexchange).
Unfortunately most were found to depend on the Statistics Toolbox, [except one](https://mathworks.com/matlabcentral/fileexchange/43135-regression-utilities), which was judged to be slower and less versatile.)
Unfortunately, most were found to depend on the Statistics Toolbox, [except one](https://mathworks.com/matlabcentral/fileexchange/43135-regression-utilities), which was judged to be slower and less versatile.)
See the [changelog](#changelog) below for further modifications.
## Installation
......@@ -36,9 +35,10 @@ Please refer to the comments in the header lines of [TheilSen.m](TheilSen.m).
### Example
The script [example.m](example.m) simulates data based on known "true" values with minor, additive Gaussian noise.
The script [example.m](example.m) simulates data based on known, "true" values with minor, additive Gaussian noise.
The data are then corrupted with a small percentage of outliers.
It then fits and compares the least squares with the Theil-Sen estimator.
The data are fit with the Theil-Sen estimator and least squares, for comparison.
Note how a few "unlucky" outliers can bias the least squares estimate (LS), but have little effect on the Theil-Sen estimator (TS).
<img src="example.svg" alt="plot from example.m" width=500px />
......
function coef = TheilSen(X, y)
% THEILSEN performs Theil-Sen robust, simple linear regression(s) of X on y.
% THEILSEN performs Theil-Sen robust simple linear regression(s) of X on y.
%
% For convenience, the input X may contain more than one predictor variable.
% But note that two or more predictor variables in X are treated as independent
% simple regressions; do not confuse the output with multiple regression.
% For convenience, the input X may contain more than one predictor variable;
% but note that two or more predictor variables in X are treated as independent
% simple regressions: Do not confuse the output with multiple regression.
%
% THEILSEN treats NaNs in X or y as missing values, and ignores them.
% THEILSEN treats NaNs in X or y as missing values and ignores them.
%
% INPUT
% X: One or more columns vectors containing explanatory/predictor variables.
% Rows contain the observed predictor variables.
% X: One or more column vectors containing explanatory/predictor variables.
% Rows represent observations (assumed to be i.i.d.).
% y: A column vector containing the observations of the response variable.
%
% OUTPUT
% coef: Estimated regression coefficients for each predictor column in X, with
% respect to the response variable y. Each column in coef corresponds
% to one predictor in X, i.e. it has as many columns as X does.
% to one predictor in X, i.e. it will have as many columns as X.
% The first row, i.e. coef(1, :), contains the estimated offset(s).
% The second row, i.e. coef(2, :), contains the estimated slope(s).
% (This output format is chosen to avoid confusion, e.g. with previous
......@@ -39,7 +39,7 @@ function coef = TheilSen(X, y)
sizeX = size(X);
sizeY = size(y);
if length(sizeY) ~= 2 || sizeY(1) < 2 || sizeY(2) ~= 1 || ~isnumeric(X)
if length(sizeY) ~= 2 || sizeY(1) < 2 || sizeY(2) ~= 1 || ~isnumeric(Y)
error('Input y must be a column array of at least 2 observed responses.')
end
......@@ -51,16 +51,13 @@ if sizeX(1) ~= sizeY(1)
error('The number of rows (observations) of X and y must match.')
end
Num_Obs = sizeX(1); % rows in X and y are observations
Num_Obs = sizeX(1); % rows in X (and y) are observations
Num_Pred = sizeX(2); % columns in X are (independent) predictor variables
if Num_Obs < 2
error('Expecting a data matrix Obs x Dim with at least 2 observations.')
end
%%% For the curious, here more readable code for 1 predictor column in X.
%%% However, the special case is absorbed in the general version for
%%% any number of columns in X, for the sake of code simplicity.
%%% For the curious, here more readable code for 1 column in X.
%%% This special case is absorbed in the general version below, for
%%% any number of columns in X, to avoid code duplication.
%
% % calculate slope for all pairs of data points
% C = nan(Num_Obs, Num_Obs);
% for i = 1:Num_Obs-1
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment