Skip to content
Snippets Groups Projects
Commit 225e5d0b authored by Keyser, Johannes's avatar Keyser, Johannes
Browse files

Improved documentation and example code.

parent 2f31897f
No related branches found
No related tags found
No related merge requests found
*.m~
# Theil-Sen-Matlab # Theil-Sen estimator for Matlab
## Description ## Description
...@@ -6,16 +6,18 @@ A fast, stand-alone Theil-Sen estimator for robust regression in Matlab. ...@@ -6,16 +6,18 @@ A fast, stand-alone Theil-Sen estimator for robust regression in Matlab.
### Theil-Sen estimator ### Theil-Sen estimator
A [Theil-Sen estimator](https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator) provides robust linear regression in the 2D plane. A [Theil-Sen estimator](https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator) provides robust linear regression in the 2D plane:
The resulting estimates of slope and intercept are relatively insensitive to outliers. The resulting estimates of slope and intercept are relatively insensitive to outliers.
FIXME: Mention main idea (data point pairs), and resulting complexity.
### No dependency on Matlab toolboxes ### No dependency on Matlab toolboxes
This code is based on [Theil-Sen Robust Linear Regression](https://mathworks.com/matlabcentral/fileexchange/48294-theil-sen-robust-linear-regression), version 1.2.0.0, by [Zachary Danziger](https://mathworks.com/matlabcentral/profile/authors/1044524). This code is based on [Theil-Sen Robust Linear Regression](https://mathworks.com/matlabcentral/fileexchange/48294-theil-sen-robust-linear-regression), version 1.2.0.0, by [Zachary Danziger](https://mathworks.com/matlabcentral/profile/authors/1044524).
It was modified mainly by re-defining its call to the `nanmedian()` function, which creates a dependency on the (commercially licensed) [Statistics Toolbox](https://mathworks.com/products/statistics.html). It was modified mainly by re-defining its call to the `nanmedian()` function, which creates a dependency on the (commercially licensed) [Statistics Toolbox](https://mathworks.com/products/statistics.html).
There are several other implementations of the Theil-Sen estimator on Mathworks's [File Exchange](https://mathworks.com/matlabcentral/fileexchange). There are several other implementations of the Theil-Sen estimator on Mathworks's [File Exchange](https://mathworks.com/matlabcentral/fileexchange).
Unfortunately all were found to depend on the Statistics Toolbox, [except one one](https://mathworks.com/matlabcentral/fileexchange/43135-regression-utilities), but that implementation was judged to be slower and less versatile. Unfortunately all were found to depend on the Statistics Toolbox, [except one](https://mathworks.com/matlabcentral/fileexchange/43135-regression-utilities), which was judged to be slower and less versatile.
## Installation ## Installation
...@@ -31,21 +33,27 @@ TODO: Add examples, and show some expected output. ...@@ -31,21 +33,27 @@ TODO: Add examples, and show some expected output.
- [ ] Mention original use case. - [ ] Mention original use case.
- [ ] Add usage example and plot. - [ ] Add usage example and plot.
- [ ] Add a drop-in for `nanmean()` to become independent of Statistics Toolbox. - [ ] Add a drop-in for `nanmean()` to become independent of Statistics Toolbox.
- [ ] Adjust formatting to my taste. - [ ] Change input interface to match Matlab's `robustfit()`.
- [ ] Adapt license file. - [ ] Add more checks for input sanity.
## Changelog
- October 2014 by Z. Danziger: Original version.
- September 2015 by Z. Danziger: Updated help, speed increase for 2D case
- March 2022 by J. Keyser: Adjusted formatting, added documentation,
## Contributing and project status ## Contributing and project status
This project is relatively unmaintained, and only shared as-is, in the hope to be helpful. This project is relatively unmaintained, and only shared as-is, in the hope to be helpful.
The authors do not accept any liability. If you find a bug, feel free to let the author(s) know.
If you find a bug, feel free to the author(s) know (see below).
Feature requests should be directed to the original author. Feature requests should be directed to the original author.
## Authors ## Authors
1. Zachary Danziger, principal author ([Matlab profile](https://de.mathworks.com/matlabcentral/profile/authors/1044524), [Lab webpage](https://anil.fiu.edu/)) 1. Zachary Danziger, original author ([Matlab profile](https://de.mathworks.com/matlabcentral/profile/authors/1044524), [Lab webpage](https://anil.fiu.edu/))
2. Johannes Keyser [JLU GitLab profile](@jkeyser) 2. Johannes Keyser
## License ## License
Please see [license.txt](license.txt). [Simplified BSD License](https://en.wikipedia.org/wiki/BSD_licenses#2-clause_license_(%22Simplified_BSD_License%22_or_%22FreeBSD_License%22)), see [license.txt](license.txt).
...@@ -3,27 +3,30 @@ function [m, b] = TheilSen(data) ...@@ -3,27 +3,30 @@ function [m, b] = TheilSen(data)
% %
% [m, b] = TheilSen(data) % [m, b] = TheilSen(data)
% %
% INPUT
% data: A MxD matrix with M observations. The first D-1 columns are the % data: A MxD matrix with M observations. The first D-1 columns are the
% explanatory variables and the Dth column is the response such that % explanatory variables and the Dth column is the response such that
% data = [x1, x2, ..., x(D-1), y]; % data = [x1, x2, ..., x(D-1), y];
% %
% OUTPUT
% m: Estimated slope of each explanatory variable with respect to the % m: Estimated slope of each explanatory variable with respect to the
% response varuable. Therefore, m will be a vector of D-1 slopes. % response varuable. Therefore, m will be a vector of D-1 slopes.
% b: Estimated offsets. % b: Estimated offsets.
% %
% EXAMPLE
% See accompanying file example.m.
% %
% Source: % REFERENCE
% Gilbert, Richard O. (1987), "6.5 Sen's Nonparametric Estimator of % - Gilbert, Richard O. (1987), "6.5 Sen's Nonparametric Estimator of Slope",
% Slope", Statistical Methods for Environmental Pollution Monitoring, % Statistical Methods for Environmental Pollution Monitoring,
% John Wiley and Sons, pp. 217-219, ISBN 978-0-471-28878-7 % John Wiley and Sons, pp. 217-219, ISBN 978-0-471-28878-7
% %
% AUTHORS
% 2014-2015 Zachary Danziger
% 2022 Johannes Keyser
% %
% % LICENSE
% %%% Z. Danziger October 2014 %%% % Simplified BSD license, see accompanying file license.txt.
% edits Z. Danziger September 2015:
% - updated help
% - speed increase for 2D case
%
sz = size(data); sz = size(data);
if length(sz) ~= 2 || sz(1) < 2 if length(sz) ~= 2 || sz(1) < 2
......
n = 100; % This example demonstrates the use of TheilSen.m with simulated data.
outN = round(0.2 * n); clearvars *
noise = randn(n, 2) * 0.1;
noise(randi(n * 2, [outN, 1])) = randn(outN, 1) * 5; % Choose total sample size and fraction of outliers:
data = [linspace(0, 10, n)', linspace(0,5,n)'] + noise; N_total = 20;
Bhat = [ones(n, 1), data(:, 1)] \ data(:, 2); N_outlr = round(0.2 * N_total);
[m, b] = TheilSen(data);
plims = [min(data(:, 1)), max(data(:, 1))]'; % Simulate a linear data set
true_b0 = -2;
true_b1 = 10;
data_x = linspace(0, 1, N_total)';
data_y = true_b0 + true_b1 .* data_x;
% add some "usual" Gaussian noise to both dimensions
SDx_usual = 0.1;
SDy_usual = 2.0;
data_x = data_x + randn(size(data_x)) * SDx_usual;
data_y = data_y + randn(size(data_y)) * SDy_usual;
% simulate "outliers" as signal-dependent, positive noise in y dimension
SDy_outlr = 20 * SDy_usual;
outlr_idx = randperm(N_total, N_outlr); % randomly select data points
outlr_y = data_y(outlr_idx); % start with original value
outlr_y = abs(outlr_y) + 3 * SDy_usual; % set to unlikely high value
outlr_y = outlr_y + abs(randn(N_outlr, 1) * SDy_outlr); % over-dispersion
outlr_y = outlr_y .* data_x(outlr_idx); % scale by x magnitude
data_y(outlr_idx) = outlr_y;
% estimate least squares parameters
est_ls = [ones(N_total, 1), data_x] \ data_y;
% estimate Theil-Sen parameters
[m, b] = TheilSen([data_x, data_y]);
est_ts = [b, m];
% plot everything
figure() figure()
plot(data(:, 1), data(:, 2), 'k.', ... plims = [min(data_x), max(data_x)]';
plims, plims * m + b, '-r', ... normal_idx = setdiff(1:N_total, outlr_idx);
plims, plims * Bhat(2) + Bhat(1), '-b', 'linewidth', 2) plot(data_x(normal_idx), data_y(normal_idx), 'ko', ...
legend('Data', 'Theil-Sen', 'Least Squares', 'location', 'NW') data_x(outlr_idx), data_y(outlr_idx), 'rx', ...
title(sprintf('Acual Slope = %2.3g, LS est. = %2.3g, TS est. = %2.3g', ... plims, plims * est_ls(2) + est_ls(1), '-r', ...
[0.5 Bhat(2) m])) plims, plims * m + b, 'b-', ...
'linewidth', 2)
legend('Normal data', 'Outlier', 'Least Squares', 'Theil-Sen', 'location', 'NW')
title(sprintf('True: [%.3g, %.3g], LS: [%.3g, %.3g], TS: [%.3g, %.3g]', ...
[true_b0, true_b1, est_ls(1), est_ls(2), est_ts(1), est_ts(2)]))
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment