Skip to content
Snippets Groups Projects
Commit 324fa4b9 authored by Hartmut Stadie's avatar Hartmut Stadie
Browse files

add link to repo

parent 9b921f5e
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:87c9d786-6d5c-4366-9701-5fa127727caa tags:
# Lecture 1
---
## Basic statistics
#
<br>
Material: [https://gitlab.rrz.uni-hamburg.de/BAN1966/statlecture](https://gitlab.rrz.uni-hamburg.de/BAN1966/statlecture)
<br>
Hartmut Stadie
hartmut.stadie@uni-hamburg.de
%% Cell type:markdown id:a3347273 tags:
## Probability
### Probability
When is a system *random*?
The degree of randomness can a quantified with the concept of *probability*.
Consider the sample space $S$:
Axioms by Kolmogorov:
- for every subset $A$ in $S$,
$P(A) \ge 0$
- for two disjoint subsets $A$ and $B$
($A \cap B = \emptyset$),
$P(A \cup B) = P(A) + P(B)$
- for the whole sample space $S$,
$P(S) = 1$
*random variable*: a variable that takes on a specific value for each element of $S$
%% Cell type:markdown id:a7de85eb tags:
### Conditional probability
%% Cell type:markdown id:67a031e2 tags:
Definition: **Conditional probability**
Conditional probability for two subsets $A$ and $B$ in $S$:
$P(A|B) = \dfrac{P(A \cap B)}{P(B)}$
Definition: **independence**
Two subsets $A$ and $B$ in $S$ are independent, if
$P(A \cap B) = P(A) P(B)$
If $A$ and $B$ are independent:
$P(A|B) = P(A)$ and $P(B|A) = P(B)$
%% Cell type:markdown id:eb310ab8 tags:
<img src="./figures/08/Conditional_probability.png" />
CC0, via Wikimedia Commons
%% Cell type:markdown id:8a49042a tags:
### Bayes' theorem
%% Cell type:markdown id:a6fd8465 tags:
$P(A|B) = \dfrac{P(A \cap B)}{P(B)}$
$P(B|A) = \dfrac{P(B \cap A)}{P(A)}$
Theorem: $$P(A|B) = \dfrac{P(B|A) P(A)}{P(B)}$$
%% Cell type:markdown id:72effe84 tags:
<img src="./figures/08/bayes.gif" style="width:80.0%" />
most likely not Bayes
%% Cell type:markdown id:deb1f10f tags:
### Classic example
Test for disease: Prior: $P(\text{disease}) = 0.001$;
$P(\text{no disease}) = 0.999$
Test: $P(+|\text{disease}) = 0.98$;
$P(+|\text{no disease}) = 0.03$
What is the implication of a positive test result?
$$\begin{aligned}
P(\text{disease}|+) &= & \dfrac{P(+|\text{disease}) P(\text{disease})}{P(+)} \\
& = &\dfrac{P(+|\text{disease}) P(\text{disease})}{P(+|\text{disease})P(\text{disease})+P(+|\text{no disease})P(\text{no disease})}\\
& = &\dfrac{0.98\cdot 0.001}{0.98\cdot 0.001 + 0.03\cdot 0.999}\\
& = &0.32\\
\end{aligned}$$
%% Cell type:markdown id:c9c11bcb tags:
## Interpretation of probability
### Interpretations of probability
%% Cell type:markdown id:5104a1d9 tags:
**Objective interpretation**:
Probability as a relative frequency:
$$P(A) = \lim_{n\to\infty} \dfrac{\text{number of occurrences of outcome $A$ in $n$ measurements}}{n}$$
(*frequentist)*
%% Cell type:markdown id:389d7d78 tags:
**Subjective interpretation**:
$$P(A) = \text{degree of belief that hypotheses $A$ is true}$$
Typical example:
$$P(\text{theory}|\text{data}) \propto P(\text{data}|\text{theory}) P(\text{theory})$$
(*Bayesian*)
%% Cell type:markdown id:b8ab42a5-b7af-4942-9a7a-91b73e0ce625 tags:
# Probability Density Functions
Single continuous variable $x$ which describes the outcome of an experiment:
probability density function (p.d.f.) $f(x)$:
- probability to observe $x$ in the interval $[x, x + dx]$:
$f(x)\,dx$
- normalization $$\int_S f(x)\,dx = 1$$
Cumulative distribution (cumulative density function (cdf)) $F(x)$:<br>
(mathematics: distribution function)
- probability to observe values less of equal than $x$:
$$F(x) = \int_{-\infty}^x f(x^\prime)\,dx^\prime$$
%% Cell type:markdown id:13563524-0626-4a69-a6ab-f87d7f19a016 tags:
### Example
$$P(a \le x \le b) = \int_a^b f(x)\,dx = F(b) - F(a)$$
%% Cell type:markdown id:99c06502 tags:
# Probability Density Functions
%% Cell type:markdown id:165ca1d8 tags:
<img src="./figures/08/Lognormal_distribution_PDF.svg" width="150%" />
%% Cell type:markdown id:7b97cfc3 tags:
<img src="./figures/08/CDF-log_normal_distributions.svg" width="75%" />
%% Cell type:markdown id:8eb9fa5b tags:
### Quantiles
%% Cell type:markdown id:8e7b412c tags:
Quantile $x_\alpha$ is the value of the random variable $x$ with
$$F(x_\alpha) = \int_{-\infty}^{x_\alpha} f(x)\,dx = \alpha$$
Hence: $$x_\alpha = F^{-1}(\alpha)$$
Median: $x_{\frac{1}{2}}$
$$F(x_{\frac{1}{2}}) = 0.5$$ $$x_{\frac{1}{2}} = F^{-1}(0.5)$$
%% Cell type:markdown id:14e4b782-9573-4e87-89d0-901dade25538 tags:
<img src="./figures/09/Normalverteilung.png" width=80% alt="image" />
%% Cell type:markdown id:1258f494 tags:
### Joint probability densisty function
%% Cell type:markdown id:783ab1f3 tags:
Example: outcome of a measurement characterized by two continuous variables $x,y$. <br>
Event $A$: $x$ observed in $[x, x + dx]$, y anywhere<br>
Event $B$: $y$ observed in $[y, y + dy]$, x anywhere
$$P(A \cap B) = \text{probability for $x$ in $[x, x + dx]$ and $y$ in $[y, y + dy]$} = f(x, y)\,dxdy$$
Marginal p.d.f.: $$f_x(x) = \int_{-\infty}^\infty f(x,y)\,dy$$
$$f_y(y) = \int_{-\infty}^\infty f(x,y)\,dx$$
%% Cell type:markdown id:8e74c511-b730-4781-a751-e1895983cc4c tags:
<img src="./figures/09/W_top.png" width="150%" alt="image" />
%% Cell type:markdown id:8f2cf729 tags:
### Marginal pdfs
%% Cell type:markdown id:438662b0 tags:
<img src="./figures/09/W.png" style="width:47.0%"
alt="image" />
%% Cell type:markdown id:55461265 tags:
<img src="./figures/09/top.png" style="width:47.0%"
alt="image" />
%% Cell type:markdown id:c88542a8 tags:
### Conditional p.d.f.
%% Cell type:markdown id:ec5c20d4 tags:
Conditional p.d.f.: $$g(x|y) = \frac{f(x,y)}{f_y(y)}$$
$$h(y|x) = \frac{f(x,y)}{f_x(x)}$$
%% Cell type:markdown id:7624d391-9fd5-4567-bf84-632eead1bb20 tags:
<img src="./figures/09/top_cond.png" width="150%"
alt="image" />
%% Cell type:markdown id:2e0275c5 tags:
### Bayes' theorem
%% Cell type:markdown id:76ce2f83 tags:
$g(x|y) = \frac{f(x,y)}{f_y(y)}$ and
$h(y|x) = \frac{f(x,y)}{f_x(x)}$
Theorem: $$g(x|y) = \frac{h(y|x) f_x(x)}{f_y(y)}$$
With $f(x,y) = h(y|x) f_x(x) = g(x|y) f_y(y)$:
$$f_x(x) = \int_{-\infty}^\infty g(x|y) f_y(y)\,dy$$
$$f_y(y) = \int_{-\infty}^\infty h(y|x) f_x(x)\,dy$$
%% Cell type:markdown id:ba8fae9b-636b-4b7e-a36d-3be582c000f5 tags:
<img src="./figures/08/bayes.gif" style="width:80.0%" />
most likely not Bayes
%% Cell type:markdown id:fdd11743-1e7b-48b6-9e9b-41bee7d4f23b tags:
## Functions of random variables
### functions of random variables
Let $x$ be a random variable, $f(x)$ its p.d.f. and
$a(x)$ a continuous function:
What is the p.d.f $g(a)$?
equal probability for $x$ in $[x, x+dx]$ and $a$ in $[a, a+da]$:
$$g(a) da = \int_{dS} f(x)\,dx$$
if $x(a)$ (the inverse of $a(x)$) exists:
$$g(a) da = \left| \int_{x(a)}^{x(a +da)} f(x^\prime)\,dx^\prime \right| = \int_{x(a)}^{x(a) + |\frac{dx}{da}|da} f(x^\prime)\,dx^\prime$$
or
$$g(a) = f(x(a)) \left|\frac{dx}{da}\right|$$
%% Cell type:markdown id:525847ad-2ddf-468c-883f-35bc853bac8c tags:
### Examples
- example 1:
For $x$ equally distributed between 0 and 1, p.d.f. of $x$: $u(x) = 1$ and $a(x) = \sqrt{x}$, $x(a) = a^2$<br>
p.d.f. $g(a)$:
$$g(a) = u(x(a)) \left|\frac{dx}{da}\right| = 1 \cdot \left|\frac{da^2}{da}\right| = 2a \text{ (linearly distributed)}$$
<br>
- example 2:
For $x$ equally distributed between 0 and 1, p.d.f. of $x$: $u(x) = 1$ and $a(x) = F^{-1}(x)$, $x(a) = F(a)$<br>
p.d.f. $g(a)$:
$$g(a) = u(x(a)) \left|\frac{dx}{da}\right| = 1 \cdot \left|\frac{dF(a)}{da}\right| = f(a$$
%% Cell type:markdown id:0c669bab-6c81-45fb-a506-8ef709cd4687 tags:
### Functions of vectors of random variables
Let $\vec x$ be vector of random variables, $f(\vec x)$ the p.d.f. and $\vec a(\vec x)$ a continuous function:
What is the p.d.f. $g(\vec a)$?
$$g(\vec a) = f(\vec x) \left| J \right| \text{, where $\left| J \right|$ is the absolute value of Jacobian determinant of } J =
\begin{array}{rrrr}
\frac{\partial x_1}{\partial a_1} & \frac{\partial x_1}{\partial a_2} & \dots & \frac{\partial x_1}{\partial a_m} \\[6pt]
\frac{\partial x_2}{\partial a_1} & \frac{\partial x_2}{\partial a_2} & \dots & \frac{\partial x_2}{\partial a_m} \\[6pt]
\vdots & \vdots & \ddots & \vdots \\[6pt]
\frac{\partial x_n}{\partial a_1} & \frac{\partial x_n}{\partial a_2} & \dots & \frac{\partial x_n}{\partial a_m} \\[6pt]
\end{array}$$
%% Cell type:markdown id:2907584f-bb17-463e-b084-749f2011bd4c tags:
### Expectation value and moments
- **Definition:**
expectation value of the function $h(x)$ for a p.d.f. $f(x)$:
$$E[h] = \int_{-\infty}^{\infty} h(x) \, f(x)\,dx$$
- **special case:** $h(x) = x$:
$$E[x] = \int_{-\infty}^{\infty} x \, f(x)\,dx = <x>$$
$E[x]$ is called the population mean or just mean, $\bar x$ or $\mu$.
- Expectation value is a linear operator:
$$E[a\cdot g(x) + b \cdot h(x)] = a\cdot E[g(x)] + b\cdot E[h(x)]$$
- $n$th moment:
$$E[x^n] = \int_{-\infty}^{\infty} x \, f(x)\,dx$$
- $n$th central moment:
$$E[(x - E[x])^n] = E[(x-\mu)^n] = \int_{-\infty}^{\infty} x \, f(x)\,dx$$
%% Cell type:markdown id:4b7f6667-c019-4b23-83fd-a1758d748762 tags:
## Grundbegriffe
### Grundbegriffe
Diskrete Zufallsvariable Mittelwert:
$$<r> = \bar r = \sum _{i=1}^N r_i P(r_i)$$
Kontinuierliche Zufallsvariable Wahrscheinlichkeitsdichte $f(x)$ mit
- $P(a \leq x \leq b) = \int_a^b f(x)\,dx$
- $f(x) \geq 0$
- $\int_{-\infty}^{\infty} f(x)\,dx = 1$
Mittelwert:
$$<x> = \bar x = \int_{-\infty}^{\infty} x \, f(x)\,dx = \mu_x$$
%% Cell type:markdown id:5cd273e7-3129-454b-a31c-4e91c4316bf4 tags:
### Variance and standard deviation
variance $V[x]$:
- measure for the width of a p.d.f.
- second central moment
- definition:
$$V[x] = E[(x - \mu_x)^2] = \int_{-\infty}^{\infty} (x-\mu_x)^2 \, f(x)\,dx$$
- useful relations:
$$V = E[(x - \mu)^2] = E[x^2 - 2x\mu + \mu^2] = E[x^2] - 2\mu E[x] + \mu^2 = E[x^2] - 2 \mu^2 + \mu^2 = E[x^2] - (E[x])^2$$
$$V[ax] = a^2 V[x]$$
%% Cell type:markdown id:30d01c8b-50a5-4d90-8e0b-89a68f52f9bd tags:
### Variance and standard deviation
standard deviation $\sigma$:
- measure for the variation of a random variable around its mean
<br>
- in physics: “the error”
<br>
- definition $$\sigma = \sqrt{V[x]}$$<br>
%% Cell type:markdown id:ae67472e-f9b2-46fa-9679-15553d31aaaf tags:
### Covariance
- covariance $V_{xy}$ for two random variables $x$ and $y$ with p.d.f. $f(x,y)$:
$$V_{xy} = E[(x - \mu_x)(y - \mu_y)] = E[xy] - \mu_x \mu_y$$
$$V_{xy} = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} xy\, f(x, y)\,dx \,dy - \mu_x\mu_y$$
- Covariance $V_{ab} = \text{cov}[a, b]$ of two $a$ und $b$ functions of the random vector $\vec x$:
$$\text{cov}[a, b] = E[(a - \mu_a)(b - \mu_b)] = E[ab] - \mu_a \mu_b$$
$$\text{cov}[a, b] = \int_{-\infty}^{\infty} \dots \int_{-\infty}^{\infty} a(x) b(x)\, f(\vec x)\,dx_1 \dots \,dx_n - \mu_a\mu_b$$
%% Cell type:markdown id:e9c61f18 tags:
### Covariance
- covariance: $$\text{cov}(x,y) = E[(x-E[x])(y-E[y])] = E[xy] - E[x]E[y]$$
for samples $X = x_1, x_2,\dots, x_N$ and $Y = y_1, y_2,\dots, y_N$:
$$\text{cov}(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n (x_i - \overline x)(y_i - \overline y)$$
- correlation $$\rho_{xy} = \frac{\text{cov}(X,Y)}{\sqrt{\text{cov}(X,X)\text{cov}(Y,Y)}}$$
%% Cell type:markdown id:576b4570-8a06-4224-b867-31459483e6bb tags:
### Covariance matrix
$$C = \left(
\begin{array}{rr}
V_{xx} & V_{xy} \\
V_{yx} & V_{yy}\\
\end{array}
\right)$$
Remarks:
- sometimes called error matrix
<br>
- $V_{xy} = V_{yx}$, matrix is symmetric
<br>
- $V_{ii} > 0$, matrix is positive (semi)definite
<br>
- correlation matrix: $$C^\prime = \left(
\begin{array}{rr}
V_{xx}/V_{xx} & V_{xy}/\sqrt{V_{xx}V_{yy}} \\
V_{xy}/\sqrt{V_{xx}V_{yy}} & V_{yy}/V_{yy}\\
\end{array}
\right) = \left(
\begin{array}{rr}
1 & \rho_{xy} \\
\rho_{xy} & 1\\
\end{array}
\right)$$
- correlation coefficient:
$$\rho_{xy} = \frac{V_{xy}}{\sqrt{V_{xx}V_{yy}}}$$
%% Cell type:markdown id:c3b63f0e tags:
### Error propagation
Suppose we have a random vector $\vec x$ distributed according to joint p.d.f. $f(\vec x)$ with mean values $\vec \mu$ and covariance matrix $V$:
What is the variance of the function $y(\vec x)$?
Expand $y$ around $x = \vec \mu$:
$$y(x) \approx y(\vec \mu) + \sum_{i = 1}^{N} \frac{\partial y}{\partial x_i}\big|_{\vec \mu}(x_i-\mu_i)$$
Expectation value of $y$:
$$E[y] \approx y(\vec \mu)$$
Expectation value of $y^2$:
$$E[y^2] \approx E[(y(\vec \mu) + \sum_{i = 1}^{N} \frac{\partial y}{\partial x_i}\big|_{\vec \mu}(x_i-\mu_i))(y(\vec \mu) + \sum_{j = 1}^{N} \frac{\partial y}{\partial x_j}\big|_{\vec \mu}(x_i-\mu_j))] = y^2(\vec \mu) + \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{\partial y}{\partial x_i}\big|_{\vec \mu} \frac{\partial y}{\partial x_j}\big|_{\vec \mu}E[(x_i-\mu_i)(x_j- \mu_j)]$$
$$E[y^2] = y^2(\vec \mu) + \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{\partial y}{\partial x_i}\big|_{\vec \mu} \frac{\partial y}{\partial x_j}\big|_{\vec \mu} V_{ij}$$
variance of $y$:
$$\sigma^2_y = E[y^2] - E[y]^2 \approx \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{\partial y}{\partial x_i}\big|_{\vec \mu} \frac{\partial y}{\partial x_j}\big|_{\vec \mu} V_{ij} $$
%% Cell type:markdown id:38fedec1 tags:
### Error propagation in more dimensions
Now assume a vector function $\vec y(\vec x)= y_1(\vec x),\dots,y_M(\vec x))$:
Covariance $U_{kl}$ for $y_k$ and $y_l$:
$$U_{kl} = \text{cov}[y_k, y_l] = \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{\partial y_k}{\partial x_i}\big|_{\vec \mu} \frac{\partial y_l}{\partial x_j}\big|_{\vec \mu} V_{ij}$$
With matrix of derivatives $A$ with $A_{ij} = \frac{\partial y_i}{\partial x_j}\big|_{\vec \mu} $):
$$ U = A V A^{T}$$
Example: $y = x_1 + x_2$ and, hence, $A = (1, 1)$
$$U = \left(\begin{array}{rr}1 & 1\\ \end{array}\right)
\left(
\begin{array}{rr}\sigma_1^2 & V_{12} \\ V_{12} & \sigma_2^2\\ \end{array}
\right)
\left(\begin{array}{r}1 \\ 1\\ \end{array}\right) =
\left(\begin{array}{rr}\sigma_1^2 + V_{12} & V_{12}+ \sigma_2^2\\ \end{array}
\right) \left(\begin{array}{r}1 \\ 1\\ \end{array}\right) = \sigma_1^2 + \sigma_2^2 + 2V_{12}$$
Example: $y = x_1 x_2$ and, hence, $A = (x_2, x_1)$
$$\frac{\sigma^2_y}{y^2} = \frac{\sigma^2_1}{x_1^2} + \frac{\sigma^2_2}{x_2^2} + 2 \frac{V_{12}}{x_1 x_2}$$
%% Cell type:markdown id:9f09c066 tags:
Now let's try a few things!!!
<br>
Any questions so far?
%% Cell type:markdown id:9eae9199-9401-40c4-b212-ae57f1ccab38 tags:
## Samples
---
Sample: $X = x_1, x_2,\dots, x_N$
Expectation value:
$$E[f(x)] = \frac{1}{N}\sum_i^N f(x_i)$$
Describing samples: minimum, maximum, frequency/histogram, means, variance, standard deviation,....
%% Cell type:markdown id:a0bccc47 tags:
### Describing samples
minimum, maximum, frequency/histogram, means, variance, standard deviation,....
Here: home and away goals in Bundesliga matches
%% Cell type:code id:b44e356e-b829-4879-b5fc-9706fffe873d tags:
``` python
import numpy as np
data = np.loadtxt('./exercises/09_data.txt')
data[0:9]
```
%% Cell type:code id:2aeacb94-518d-464f-a7ab-5282b97bc225 tags:
``` python
data[0:9,0]
```
%% Cell type:code id:1829fa78-7d7b-4bac-9f24-5129dca63629 tags:
``` python
np.min(data), np.max(data)
```
%% Cell type:markdown id:caa12e94-0b72-4f60-875d-5a306a09d036 tags:
### Histograms
%% Cell type:code id:871f915d-09f8-4d14-a79a-8fbd2f16ab75 tags:
``` python
import matplotlib.pyplot as plt
plt.hist(data[:, 0])
#plt.hist(data[:, 0], bins=np.arange(-0.25,6.25,0.5))
#plt.xlabel("k")
```
%% Cell type:markdown id:e9327372 tags:
### Histograms
%% Cell type:code id:f8263279-48b5-4421-be05-604ddbfd8d6f tags:
``` python
plt.hist(data[:, 0], bins=np.arange(-0.25,6.25,0.5))
plt.xlabel("k")
```
%% Cell type:markdown id:61619361 tags:
### Histograms
%% Cell type:code id:5056a717-6561-41ec-be39-1757984863a9 tags:
``` python
plt.hist(data[:, 0], bins=np.arange(-0.25,6.26,0.5))
plt.xlabel("k")
#plt.savefig("hist.pdf")
plt.show()
```
%% Cell type:code id:3f64d35c tags:
``` python
plt.hist(data[:, 1], bins=np.arange(-0.25,6.26,0.5))
plt.xlabel("l")
plt.show()
```
%% Cell type:markdown id:4183331e-8a9f-4af5-829f-3fcb9b2abb31 tags:
### Cumulated Distribution
%% Cell type:code id:4ecfb198-c621-4ee4-9d24-9f9a8cb8bec7 tags:
``` python
plt.hist(data[:, 0], bins=100, cumulative=True, density = True, label="kumuliert")
plt.xlabel("k")
#plt.savefig("hist2.pdf")
print("median", np.median(data[:, 0]))
```
%% Cell type:code id:d00cadf3-0cfa-4f59-9962-4b9e6b707e0b tags:
``` python
```
%% Cell type:markdown id:ceb5ca27-a96a-4ee3-967b-5f02efe66540 tags:
### Means
---
different means:
- arithmetic mean: $$ \overline{x} = E[x] = <x> = \frac{1}{n}\sum\limits_{i=1}^n x_i (= \mu)$$
- geometric mean: $$ \overline{{x}}_\mathrm {geom} = \sqrt[n]{\prod\limits_{i=1}^{n}{x_i}}$$
- quadratic mean: $$ \overline{{x}}_\mathrm{quadr} = \sqrt{E[x^2]} = \sqrt {\frac {1}{n} \sum\limits_{i=1}^{n}x_i^2} = \sqrt{\overline{x^2}} $$
%% Cell type:markdown id:2503d92a tags:
### Variance
- variance
$$V = E[(x - \mu)^2]$$
with $\mu = E[x] = \overline x$:
$$V = E[(x - {\overline x}^2]$ = E[x^2 - 2x{\overline x} + {\overline x}^2] = E[x^2] - 2{\overline x}E[x] + {\overline x}^2 = E[x^2] - 2 {\overline x}^2 + {\overline x}^2 = E[x^2] - (E[x])^2$$
for sample $X = x_1, x_2,\dots, x_N$;
$$V= \frac{1}{n}\sum\limits_{i=1}^n (x_i - \overline x)^2$$
- standard deviation:
$$\sigma = \sqrt{V}$$
%% Cell type:markdown id:68cea392-8d4f-48d7-aacc-0f10d46af3af tags:
### Exercise: Compute mean and variance of $X$
%% Cell type:code id:5f98b25d tags:
``` python
```
%% Cell type:code id:76d92d18-1d77-40d2-a910-592183635d3b tags:
``` python
print("mean", np.mean(data, axis=0))
```
%% Cell type:code id:5f0f34b4-5bbe-439b-8b4e-fbb05794b790 tags:
``` python
print("variance", np.var(data, axis=0))
```
%% Cell type:code id:70a1920f-beda-4154-ad77-22a9ffe2e39f tags:
``` python
print("standard deviation:", np.std(data, axis=0))
```
%% Cell type:markdown id:20d86d18 tags:
### Exercise: compute covariance and correlation column 1 and 2
%% Cell type:code id:bf609514 tags:
``` python
plt.hist2d(data[:,0], data[:,1], bins=np.arange(-0.5,6.1,1))
;
```
%% Cell type:code id:6fff5415 tags:
``` python
```
%% Cell type:code id:751f9384 tags:
``` python
print(np.cov(data, rowvar=False))
```
%% Cell type:code id:abda1c9f tags:
``` python
print(np.corrcoef(data, rowvar=False))
```
%% Cell type:markdown id:eed39982 tags:
### Exercise: Compute variance of goals per match
Compute the variance of the sum of the home and away goals per match in three ways, where $V$ is the covariance matrix on home,away goals from before:
- wrong error propagation $U = \sigma_1^2 + \sigma_2^2 = V_{11} + V_{22}$
<br>
- correct error propagation $U = \left(\begin{array}{rr}1 & 1\\ \end{array}\right)V\left(\begin{array}{r}1 \\ 1\\ \end{array}\right)$
> You can use: `A = np.array([[1, 1]])` to define the matrix of derivatives and <br> `U=A@V@A.T`for the matrix transformation
<br>
- directly with`np.var`
<br>
%% Cell type:code id:6b0966b7 tags:
``` python
```
%% Cell type:markdown id:404bffbd tags:
What changes when you look at the goal difference?
%% Cell type:code id:f88ff1a1 tags:
``` python
A = np.array([[1, 1]])
V = np.cov(data, rowvar=False)
print(V[0,0] + V[1,1])
U = A@V@A.T
print(U)
print(np.var(data[:,0] + data[:,1]))
print(np.var(data[:,0] - data[:,1]))
```
%% Cell type:markdown id:f826b603 tags:
### Exercise: Check "functions of random variables"
%% Cell type:markdown id:ac750c83 tags:
Let's use pseudo-experiments/Monte Carlo:
* generate 100.000 uniformly distributed values $u$
* make a histogram of $u$ and of $\sqrt(u)$
%% Cell type:markdown id:0d93b2f0 tags:
Relatively easy with *scipy* and *numpy*:
* use [numpy random generator](https://numpy.org/doc/stable/reference/random/generator.html)
<br>
or
<br>
* use [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/stats.html)
* use [`scipy.stats.norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html) class
%% Cell type:code id:b424b5b0 tags:
``` python
import numpy as np
rng = np.random.default_rng(12345)
rfloat = rng.random()
print(rfloat)
```
%% Cell type:code id:785734fb tags:
``` python
u = rng.random(100000)
print(u)
plt.hist(u,bins=100, histtype='step')
plt.hist(np.sqrt(u), bins=100, histtype='step')
;
```
%% Cell type:code id:4823f38f tags:
``` python
```
......
%% Cell type:markdown id:74976eec-e9a9-46b1-824d-01dad15478bc tags:
# Lecture 2
---
## Probability density functions and confidence intervals
<br>
<br>
Material: [https://gitlab.rrz.uni-hamburg.de/BAN1966/statlecture](https://gitlab.rrz.uni-hamburg.de/BAN1966/statlecture)
<br>
Hartmut Stadie
hartmut.stadie@uni-hamburg.de
%% Cell type:markdown id:2552c4a9 tags:
# Examples of probability density functions
%% Cell type:markdown id:0383c396 tags:
## Discrete distributions
### Binomial distribution
Consider a series of independent tries, each having two possible outcomes where $p$ is the probability for success.<br>
The probability for $k$ success with $n$ tries is given by the **binomial distribution**:
$$P(k) = {n \choose k} p^k(1-p)^{n-k} \text{, } k = 0,1,2...n$$
Expectation value and variance
$$<k> = E[k] = \sum \limits_{k = 0}^{n} k P(k) = np$$
$$V[k] = \sigma^2 = np(1-p)$$
%% Cell type:markdown id:9350625b tags:
### Example
%% Cell type:markdown id:2039f885 tags:
Tossing five coins $n = 5$, $p = 0.5$
| k | 0 | 1 | 2 | 3 | 4 | 5 |
|:-----|:----:|:----:|:-----:|:-----:|:----:|:----:|
| P(k) | 1/32 | 5/32 | 10/32 | 10/32 | 5/32 | 1/32 |
<br>
plot using [`scipy.stats.binom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html)
%% Cell type:code id:662bb993 tags:
``` python
import numpy as np
import scipy
import matplotlib.pyplot as plt
ks = np.arange(0,5.1)
plt.plot(ks, scipy.stats.binom.pmf(ks, 5, 0.5),'+')
plt.xlabel("$k$")
plt.ylabel("$P(k)$")
```
%% Cell type:markdown id:7ec58fe0 tags:
### Exercise II
Statistical uncertainty on efficiency of a device/selection criterion.
<p>
Estimate the efficiency and its uncertainty from a sample with $n$ events where $k$ events are selected.
Random variable is the measured efficiency: $h_k = \frac{k}{n}$.
<br>
How large is the uncertainty?
<br>
Assumption (maybe wrong): $k$ is binomially distributed with $p_k = E[h_k] = E[\frac{k}{n}]$:
$$\begin{aligned}
\sigma(h_k) &= &\sqrt{V[\frac{k}{n}]} = \sqrt{\frac{1}{n^2} V[k]} = \sqrt{\frac{1}{n^2}\cdot np_k(1-p_k)}\\
&=& \sqrt{\frac{p_k(1-p_k)}{n}}\\
\end{aligned}$$
%% Cell type:markdown id:1180ce3f tags:
### Poisson distribution
Consider the binomial case for very large $n$, very small $p$ and $Np$ remains equal to some finite value $\mu$:
$$P(k) = \frac{\mu^ke^{-\mu}}{k!}$$
Expectation value and variance:
$$E[k] = \sum \limits_{k = 1}^{\infty} k \frac{e^{-\mu}\mu^k}{k!}
= \mu \sum \limits_{k = 1}^{\infty} k \frac{e^{-\mu}\mu^{k-1}}{(k-1)! k}
= \mu \sum \limits_{s = 0}^{\infty} \frac{e^{-\mu}\mu^{s}}{s!} = \mu$$
$$V[k] = \sigma^2 = \mu$$
%% Cell type:markdown id:06c5b8a1 tags:
### Poisson and binomial distributions
binomial distribution with $n= 1000$ and $p = 0.01$<br>
poisson distribution with $\mu = 10$ (hatched)
<img src="./figures/08/bp.jpg" width="55.0%"
alt="image" />
%% Cell type:markdown id:0d43ae1c tags:
### Example from old statistic books
%% Cell type:markdown id:dd995e3f tags:
Deaths from horse kicks in the Prussian army
The deaths from horse kicks were recorded for each year and army corps (?).
<br>
For 20 years (1875-1894) and 14 army corps, one gets:
| number of deaths $k$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|:------------------------------------------|----:|----:|----:|----:|----:|----:|----:|
| number of corps-years with $k$ death | 144 | 91 | 32 | 11 | 2 | 0 | 0 |
%% Cell type:markdown id:fce70dd5 tags:
<img src="./figures/08/poisson70.png" width="80.0%" />
poisson distribution for $\mu = \frac{196}{280} = 0.70$ and data with (wrong) error bars
%% Cell type:markdown id:b3097b69-91ec-413e-935e-e310c96d5c25 tags:
## Continuous distributions
### Normal distribution
Normal- oder Gaussian distribution
$$f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
Expectation value and variance:
$$<x> = E[x] = \mu$$
$$V[x] = \sigma^2$$
%% Cell type:markdown id:32da631f tags:
### Standard Gaussian distribution
normal distribution with $\mu = 0$ and $\sigma = 1$
<img src="./figures/10/gauss_alpha.png" width="60.0%"
alt="image" />
%% Cell type:markdown id:f7050597-71f8-45a4-a4ac-64567d17827a tags:
Probability for some intervals:
| | | |
|:----------------------|---------------------------------:|------------:|
| $|x-\mu| \ge \sigma$ | (x außerhalb $\pm 1\sigma$) ist: | $31{,}74$ % |
| $|x-\mu| \ge 2\sigma$ | (x außerhalb $\pm 2\sigma$) ist: | $4{,}55$ % |
| $|x-\mu| \ge 3\sigma$ | (x außerhalb $\pm 3\sigma$) ist: | $0{,}27$ % |
%% Cell type:markdown id:b5910862-3a33-46be-91e0-868a3093cd48 tags:
### Multivariate normal distribution
$$f_{X}(\vec x )= \frac {1}{\sqrt {(2\pi )^{p}\det({C)}}}\exp \left(-{\frac {1}{2}}(\vec x- \vec\mu)^{\top}C^{-1}(\vec x - \vec \mu)\right)$$
<img src="./figures/10/Multivariate_Gaussian.png"
width="60.0%" alt="image" />
%% Cell type:markdown id:39087e01 tags:
### 2-D multivariate normal distribution
$$\vec\mu = (\bar x, \bar y) \text{ und } C = \left(
\begin{array}{rr}
\sigma_x ^2 & \rho \sigma_x \sigma_y \\
\rho \sigma_x \sigma_y & \sigma_y^2\\
\end{array} \right)$$
<img src="./figures/10/error_elipse.png" width="70%" alt="image" />
%% Cell type:markdown id:bf784648-04e4-4d6b-a6d0-e85f4210ee1d tags:
### Poisson- Binomial- und Gauß-Verteilung
binomial distribution with $n= 1000$ and $p = 0.01$<br>
poisson distribution with $\mu = 10$(schraffiert)<br>
Gaussian distribution with $\mu = 10$, $\sigma = \sqrt{10}$
<img src="./figures/08/bpg.jpg" width="55.0%"
alt="image" />
%% Cell type:markdown id:91e4ed52 tags:
### Uniform distribution
Uniform distribution: the probability density function f(x) is constant and equal to
$\frac{1}{b-a}$ for $a \leq x \leq b$ and zero outside the interval.
%% Cell type:markdown id:a3b105a2-4332-46f7-a423-df435d4b06e8 tags:
Expectation value and variance:
$$<x> = E[x] = \int_a^b \frac{x}{b-a}\, dx = \frac{1}{2(b-a)} [b^2-a^2] = \frac{a + b}{2}$$
$$\begin{aligned}
V[x] & = &\sigma^2 = E[(x-<x>)^2] = E[x^2] - <x>^2 \\
& = & \int_a^b \frac{x^2}{b-a}\, dx - <x>^2 = \frac{b^3 - a^3}{3(b-a)} - \frac{(a+b)^2}{4}\\
& = & \frac{b^2+ab+a^2}{3}-\frac{a^2 + 2ab + b^2}{4} = \frac{(b-a)^2}{12} \\
\end{aligned}$$
%% Cell type:markdown id:dfb30ccb tags:
### Example: strip detectors
%% Cell type:markdown id:4b00e1a6 tags:
<img src="./figures/10/cms_strip.jpg" width="80.0%"
alt="image" />
%% Cell type:markdown id:ddd2a24f tags:
<img src="./figures/10/strip.png" width="80.0%"
alt="image" />
%% Cell type:markdown id:2acc164e-255b-43f4-93ee-d569d0bb9e50 tags:
one measurement: signal in strip $i$, $p(x)$ is uniformly distributed between
$b_i$ und $a_i$
How is a combination of measurements distributed?
%% Cell type:markdown id:e4d5dad2-983e-4f8f-b305-fb0a787282d6 tags:
### Zentraler Grenzwertsatz
Zentraler Grenzwertsatz: Die Wahrscheinlichkeitsdichte der Summe
$\sum_{i=0}^{n} x_i$ einer Stichprobe aus $n$ unabhängigen
Zufallsvariablen $x_i$ mit einer beliebigen Wahrscheinlichkeitsdichte
mit Mittelwert $<x>$ und Varianz $\sigma^2$ geht in der Grenze
$n \to \infty$ gegen eine Gau"s-Wahrscheinlichkeitsdichte mit Mittelwert
$\mu = n \cdot <x>$ und Varianz $V[w] = n \cdot \sigma^2$.
<embed src="./figures/08/Summe-von-Gleichverteilungen3.pdf"
style="width:60.0%" />
%% Cell type:markdown id:fb2a4c3c tags:
## What is meant with error/uncertainty on a measured quantity?
%% Cell type:markdown id:ffd68924 tags:
If we quote $a = 1 \pm 0.5$, we usually mean that the probability for the *true* value of $a$ is Gaussian $G(a, \mu, \sigma)$ distributed with $\mu = 1$ and $\sigma = 0.5$.
%% Cell type:markdown id:f8037279 tags:
### Exercise: How often can/should the measurement be outside one sigma?
Let's use pseudo-experiments/Monte Carlo:
* generate 10.000 Gaussian distributed measurements
* count how ofter they differ by more than one sigma
Relatively easy with *scipy* and *numpy*:
* use [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html)
* use [scipy.stats.norm](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) class
%% Cell type:code id:9b4557d9 tags:
``` python
```
%% Cell type:markdown id:a92fa7d1 tags:
import scipy.stats as stats
import numpy as np
pseudo_a = stats.norm.rvs(1, 0.5, 10000)
print(pseudo_a)
is_outside = abs(pseudo_a - 1) > 0.5
print(is_outside)
print("fraction outside one sigma:", sum(is_outside)/len(pseudo_a))
%% Cell type:markdown id:54d58609 tags:
### Why is it a Gaussian?
Central limit theorem:
"let $X_{1},X_{2},\dots ,X_{n}$ denote a statistical sample of size $n$ from a population with expected value (average) $\mu$ and finite positive variance $\sigma ^{2}$, and let $\bar {X_{n}}$ denote the sample mean (which is itself a random variable). Then the limit as $n\to \infty$ of the distribution of $\frac {({\bar {X}}_{n}-\mu )}{\frac {\sigma }{\sqrt {n}}}$, is a normal distribution with mean 0 and variance 1."
%% Cell type:markdown id:977b88f5-3cb7-445b-adac-f44be4d69c90 tags:
## Back to the Bundesliga
%% Cell type:code id:641ec6cc-6f1a-4399-b652-7e3da333782a tags:
``` python
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt("./exercises/09_data.txt")
summe = data[:,0] + data[:,1]
freq = plt.hist(summe, bins=10, range=(-0.5,9.5))
plt.xlabel("k")
print(freq)
```
%% Cell type:markdown id:e6bcf92b-386d-492b-9f09-e29118e93505 tags:
In 18 matches exactly six goals were scored.<br>
How large is the error on this 18?<br>
What is a 68,27\%-confidence interval for the 6-goals bin?
%% Cell type:markdown id:8c4a1016-7fc9-4eb6-b515-b51c0d16fed8 tags:
naively: $k=18$, poisson distribution with $\hat \mu = 18$ and $\sigma = \sqrt{\hat \mu} = \sqrt{18}$: $18\pm 4.12$
%% Cell type:markdown id:b26cd06e-7700-4011-b5d8-6bf56e22489f tags:
### Confidence:
$P(x_- \le x \le x_+) = \int_{x_-}^{x^+} P(x) dx$
%% Cell type:markdown id:1903f299-4d16-41b0-bb48-5f17a07a9006 tags:
### However, what is $x$? $\mu$ or $k$???
%% Cell type:markdown id:b54b11f9-e897-4d2a-a5df-961c16cdbb3f tags:
Confidence for $P(k=18) = 1$ (our measurement!)
We need the confidence interval for $\mu$.
Hence: $P(\mu_- \le \mu \le\mu_+) = \frac{\int_{\mu_-}^{\mu_+} P(k, \mu) d\mu}{\int_{0}^{\infty} P(k, \mu) d\mu}$
%% Cell type:code id:5052bdb2-6e4b-4ea6-b08e-ba22f2a4a3d2 tags:
``` python
import scipy
mus = np.linspace(0,50,1000)
plt.plot(mus, scipy.stats.poisson.pmf(18,mus))
plt.xlabel(r"$\mu$")
scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(18,x), 0, 100)
```
%% Cell type:markdown id:d278b736-1f45-4b6c-8e1a-c851bb618aeb tags:
### Test of naive interval
%% Cell type:markdown id:71c0df91-c1e0-4fbf-a94b-fd88f839ef07 tags:
$P(\mu_- \le \mu \le\mu_+) = \int_{\mu_-}^{\mu_+} P(k, \mu) d\mu$
%% Cell type:code id:c595233d tags:
%% Cell type:code id:e2ec2ce6 tags:
``` python
```
%% Cell type:code id:7c461ddf-0cb8-49b0-adc6-a7ae6f11212b tags:
``` python
scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(18,x), 18-np.sqrt(18), 18+np.sqrt(18))
```
%% Cell type:markdown id:e34f6d9d tags:
...Ok-ish...
%% Cell type:markdown id:88f176b2-deb2-4212-b4ba-d86b7964e545 tags:
## Example from Publication:
"Search for flavor changing neutral currents in decays of top quarks" (D0)
Physics Letters B 701 (2011), pp. 313-320
[https://arxiv.org/abs/1103.4574]
<img src="./figures/12/ht_bq_1jets_comb_5pc.png width="100%" \>
%% Cell type:markdown id:5135b8e6-711b-440e-8507-f049d017c137 tags:
### Let's try another bin
1 match had exactly seven goals:
Naive confidence interval: $1\pm 1$
Test it:
- draw the Poisson probability $P(1,\mu$ vs. \mu
- compute the confidence for the interval $[0, 2]$ numerically with [scipy.integrate.quad](https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.quad.html)
%% Cell type:code id:06885d9b tags:
%% Cell type:code id:819c3f7b tags:
``` python
```
%% Cell type:code id:e6102fec-1275-439a-b6b4-8cb1ee4960fc tags:
``` python
mus = np.linspace(0,10,1000)
plt.plot(mus, scipy.stats.poisson.pmf(1,mus))
plt.xlabel(r"$\mu$");
```
%% Cell type:code id:a0595e97 tags:
``` python
scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(1,x), 1-np.sqrt(1), 1+np.sqrt(1))
```
%% Cell type:markdown id:2c24eace tags:
too small!
%% Cell type:markdown id:3139be1a-8a4d-4ae9-9fa7-aa99d7efffe6 tags:
### Let's find a better interval: $[\mu_-, \mu_+]$
here:
$\mu_- = 0$
find $\mu_+$, so that the integral from $[0,\mu_+]$ is $0.6827$:
- define a function that returns the integral for a given $mu_+$
- plot the function
- use root finding with [scipy.optimize.brentq](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brentq.html) to find the value of $\mu_+$ where the integral minus 0.6827 is 0.
%% Cell type:code id:2d187a99 tags:
%% Cell type:code id:7394d804 tags:
``` python
```
%% Cell type:code id:630c83e0-2be4-41dd-a124-2ac672a2c507 tags:
``` python
def intervall(mu_plus):
return scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(1,x), 0, mu_plus)[0]
mus = np.linspace(0,10,100)
plt.plot(mus, np.vectorize(intervall)(mus))
plt.grid()
plt.xlabel(r"$\mu$")
scipy.optimize.brentq(lambda x: intervall(x)-0.6827, 0,10)
```
%% Cell type:markdown id:f20e18c3 tags:
Korrektes Intervall: $[0; 2.36]$
%% Cell type:markdown id:1c4ad7b4-bbdb-4f9d-a795-0e2dae39a181 tags:
### What is now the correct interval for the six-goals bin?
Which one?
symmetric around $\mu$ or...
e.g. symmetric in the confidence:
%% Cell type:code id:5fffd76e-14da-4854-8af0-5262e0c3c9a3 tags:
``` python
def intervall_minus(mu):
return scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(18,x), mu, 18)[0]
def intervall_plus(mu):
return scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(18,x), 18, mu)[0]
print("naive:", 18-np.sqrt(18), 18+np.sqrt(18))
mu_minus = scipy.optimize.brentq(lambda x: intervall_minus(x)-0.6827/2, 0,18)
mu_plus = scipy.optimize.brentq(lambda x: intervall_plus(x)-0.6827/2, 18,40)
print("symmetric in P:", mu_minus, mu_plus)
```
%% Cell type:code id:52f8f412-cb94-4c32-95b6-03f178cbcdb6 tags:
``` python
scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(18,x), mu_minus, mu_plus)
```
%% Cell type:markdown id:f7ef04e8-1dc9-4e8b-b597-39d277bde591 tags:
### Confidence intervals for Poisson
%% Cell type:code id:e0a6084d-dd5d-4ffc-8849-a61a8327fb4e tags:
``` python
def intervall_minus(k, mu):
return scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(k,x), mu, k)[0]
def intervall_plus(k, mu):
return scipy.integrate.quad(lambda x: scipy.stats.poisson.pmf(k,x), k, mu)[0]
def conv_limits(k, c):
c_low = intervall_minus(k, 0)
c_up = c / 2
if c_low < c/2:
mu_minus = 0
else:
c_low = c / 2
mu_minus = scipy.optimize.brentq(lambda x: intervall_minus(k, x)-c_low, 0,k)
c_up = c - c_low
mu_plus = scipy.optimize.brentq(lambda x: intervall_plus(k, x)-c_up, k, 2*k + 20)
return mu_minus, mu_plus
conv_limits(3, 0.6827)
```
%% Cell type:code id:382e5a13-5f5f-44e5-90dd-7272fbdb171a tags:
``` python
ks = np.arange(0, 20)
limits = np.zeros((len(ks), 2))
for i,k in enumerate(ks):
limits[i] = conv_limits(k, 0.6827)
#print(limits)
plt.plot(ks, limits[:,0], 'b_')
plt.plot(ks, limits[:,1], 'r_')
plt.plot(ks, ks-np.sqrt(ks), 'b.')
plt.plot(ks, ks+np.sqrt(ks), 'r.')
plt.xlabel("$k$")
plt.ylabel(r"$\mu$")
plt.grid()
print(limits[2])
```
%% Cell type:markdown id:61d94d51-fc1c-4ddb-9cbc-c416a28158f8 tags:
## Confidence intervals for Gaussian distribution
%% Cell type:code id:cb5826ae-29df-41c3-ad8f-dde61b5d90ba tags:
``` python
def intervall_minus(x, mu, sigma):
return scipy.integrate.quad(lambda y: scipy.stats.norm.pdf(x,y, sigma), mu, x)[0]
def intervall_plus(x, mu, sigma):
return scipy.integrate.quad(lambda y: scipy.stats.norm.pdf(x, y, sigma), x, mu)[0]
def conv_limits(x, c, sigma):
c_low = c / 2
mu_minus = scipy.optimize.brentq(lambda mu: intervall_minus(x, mu, sigma)-c_low, x - 10*sigma,x)
c_up = c - c_low
mu_plus = scipy.optimize.brentq(lambda mu: intervall_plus(x, mu, sigma)-c_up, x, x + 10*sigma)
return mu_minus, mu_plus
conv_limits(3, 0.6827, 1)
```
%% Cell type:code id:e87da118-0c11-4345-a656-096de6f980e9 tags:
``` python
xs = np.linspace(0,20,100)
sigma = 1
limits = np.zeros((len(xs), 2))
for i,k in enumerate(xs):
limits[i] = conv_limits(k, 0.6827, sigma)
#print(limits)
plt.plot(xs, limits[:,0], 'b')
plt.plot(xs, limits[:,1], 'r')
plt.plot(xs, xs-sigma, 'b.')
plt.plot(xs, xs+sigma, 'r.')
plt.xlabel("$x$")
plt.ylabel(r"$\mu$")
plt.grid()
print(limits[2])
```
%% Cell type:code id:202144ae-1057-41fc-8757-7f6f40c75868 tags:
``` python
print(scipy.integrate.quad(lambda y: scipy.stats.norm.pdf(3, y, 1), 2, 3)[0], scipy.integrate.quad(lambda x: scipy.stats.norm.pdf(x, 3, 1), 2,3)[0])
```
%% Cell type:markdown id:24359dd4-7289-4235-872e-1138d5fe0b0b tags:
For Gaussian: $G(x, \mu, \sigma) = G(\mu, x, \sigma)$: $\int_{\mu_-}^{\mu_+} G(x, \mu, \sigma) d\mu = \int_{\mu_-}^{\mu_+} G(\mu, x, \sigma) d\mu$
%% Cell type:markdown id:10f482d3-6667-4250-aefa-5df896e46936 tags:
## Confidence regions
Confidence values for Gaussian distribution and intervals: $[\mu - z\sigma, \mu - z\sigma]$:
%% Cell type:code id:580b83dc-3cb0-40df-bc77-8ad48d90fce5 tags:
``` python
def conv_gaus(z):
return scipy.stats.norm.cdf(z) - scipy.stats.norm.cdf(-z)
for z in [1,2,3,4,5]:
print(z, conv_gaus(z))
```
%% Cell type:markdown id:3904ad68-2b4d-4207-a7e8-35def0372238 tags:
$z$ for 68\%, 90\%, 95\% und 99\%:
%% Cell type:code id:5a71ee06-db32-4349-a3da-ae0e394fbd74 tags:
``` python
for c in [0.68, 0.90, 0.95, 0.99]:
print(c, scipy.optimize.brentq(lambda z: conv_gaus(z)-c,0, 10))
```
%% Cell type:markdown id:2d5f1dab-cebd-45fe-934e-36cd06ba2782 tags:
### Regions in two or three dimensions:
$\vec x^T = (x_1, x_2, \dots, x_n)$
Let $x_i$ be independent random variables and Gaussian distributed
%% Cell type:code id:73c11f8d-c8e4-4dc7-ad37-4ca4a33c55ea tags:
``` python
def conv_gaus_nd(z, n):
return np.power(scipy.stats.norm.cdf(z) - scipy.stats.norm.cdf(-z),n)
for z in [1,2,3,4,5]:
print(z, conv_gaus_nd(z, 1), conv_gaus_nd(z, 2), conv_gaus_nd(z, 3))
```
%% Cell type:markdown id:8dd1036c-edf9-40c4-be16-56025f49134b tags:
### Regions in two or three dimensions:
$z$ for 68\%, 90\%, 95\% und 99\%:
%% Cell type:code id:7f14d38e-1e41-460e-9876-2afb62bed08b tags:
``` python
for c in [0.68, 0.90, 0.95, 0.99]:
print(c, scipy.optimize.brentq(lambda z: conv_gaus_nd(z, 1)-c,0, 10), scipy.optimize.brentq(lambda z: conv_gaus_nd(z, 2)-c,0, 10), scipy.optimize.brentq(lambda z: conv_gaus_nd(z,3)-c,0, 10))
```
%% Cell type:markdown id:9aca662e-bdd6-408e-8f28-852614e51577 tags:
1,2,3-$\sigma$-equivalents in two or three dimensions:
%% Cell type:code id:702daac8-81e8-406c-b70f-c3146c58a493 tags:
``` python
for k in [1, 2, 3]:
c = conv_gaus_nd(k, 1)
print(c, scipy.optimize.brentq(lambda z: conv_gaus_nd(z, 2)-c,0, 10), scipy.optimize.brentq(lambda z: conv_gaus_nd(z,3)-c,0, 10))
```
%% Cell type:markdown id:bba9c7fc-9e11-4441-90e0-f1d186a1726e tags:
# Hypothesis tests
**Bundesliga**:
Hypothesis: "The $k_i$ goals in each Bundesliga match $i$ are Poisson distributed with a common parameter $\mu = <k>$."
We need an alternative Hypothesis for the test: "The goals in each Bundesliga match $k_i$ are Poisson distributed with parameter $\mu_i = ki$ for each match."
Errors of first and second kind:
* error of first kind: The hypothesis is true, but is rejected.
significance: $\alpha$
specificity: $1-\alpha$ (efficiency)
* error of the second kind: The hypothesis is accepted, but is wrong (false positive).
probability for error: $\beta$
power: $1-\beta$
%% Cell type:markdown id:644cb69d tags:
### Example
Gaussian distributed random variable $x$ ($\sigma = 1$)
Hypothesis: $\mu = 0$
%% Cell type:markdown id:4e67fff2-211b-4eef-95e2-44e3b363098c tags:
For which region $x_- < x < x_+$ shall we accept the hypothesis for an error of the first kind of $\alpha = 5\%$?
%% Cell type:code id:85e25147-c902-4434-b3dc-908486408c5b tags:
``` python
x = np.linspace(-5,5,100)
xout1 = np.linspace(-5, -1.9599639845401602)
xout2 = np.linspace(1.9599639845401602, 5)
plt.plot(x, scipy.stats.norm.pdf(x))
plt.fill_between(xout1, scipy.stats.norm.pdf(xout1), np.zeros_like(xout1),color="c")
plt.fill_between(xout2, scipy.stats.norm.pdf(xout2), np.zeros_like(xout2),color="c")
plt.xlabel("$x$")
plt.grid()
```
%% Cell type:markdown id:6cbb0d56-981d-4dfa-9a7a-0d6d4a51a357 tags:
### What is the error of the second kind?
%% Cell type:markdown id:cb79afd6-ba74-448b-9096-05cedeb842a6 tags:
Needs an alternative hypothesis!
Example: true $\mu = 3$:
%% Cell type:code id:603a31e1-1425-4eb6-9b12-c44ba31d06a6 tags:
``` python
x = np.linspace(-5,8,200)
xin = np.linspace(-1.9599639845401602, 1.9599639845401602)
xout1 = np.linspace(-5, -1.9599639845401602)
xout2 = np.linspace(1.9599639845401602, 5)
plt.plot(x, scipy.stats.norm.pdf(x))
plt.fill_between(xout1, scipy.stats.norm.pdf(xout1), np.zeros_like(xout1),color="c")
plt.fill_between(xout2, scipy.stats.norm.pdf(xout2), np.zeros_like(xout2),color="c")
plt.plot(x, scipy.stats.norm.pdf(x, 3))
plt.fill_between(xin, scipy.stats.norm.pdf(xin, 3), np.zeros_like(xin),color="orange")
plt.xlabel("$x$")
plt.grid()
print("error of second kind: beta:", scipy.stats.norm.cdf(1.9599639845401602, 3) - scipy.stats.norm.cdf(-1.9599639845401602, 3))
```
%% Cell type:markdown id:f6a454b7-fce4-4dca-ad3c-9e220956706b tags:
## Neyman-Pearson Lemma
Hypothesis: $H$
Alternative hypothesis: $A$
Goal: a criterion for an acceptance region giving the highest power (signal purity).
Assume $H$ is accepted for $x$ in range $V$:
$\int_{V} P_{H}(x) dx = 1 - \alpha$ (large)
$\int_{V} P_{A}(x) dx = \beta$ (small)
In $V$ are the values of $x$, for which $\frac{P_{H}(x)}{P_{A}(x)}$ is large.
Neyman-Pearson lemma: best range selected with $\frac{P_{H}(x)}{P_{A}(x)} > c$ (likelihood ratio)
%% Cell type:markdown id:b636ff7e-5ba1-4fc5-bb40-fb29b63ca88c tags:
### Finally: Does are model for the Bundesliga work?
**Bundesliga**:
Hypothesis: "The $k_i$ goals in each Bundesliga match $i$ are Poisson distributed with a common parameter $\mu = <k>$."
Alternative hypothesis: "The goals in each Bundesliga match $k_i$ are Poisson distributed with parameter $\mu_i = ki$ for each match."
Neyman-Pearson: likelihood ratio: $\frac{P_{A}(x)}{P_{H}(x)} = \frac{\hat L(\vec k; \vec k)}{L(\mu; \vec k)} > c$
We look at the $log$ of the likelihood ratio for our model $-lnL(\mu; \vec k)$ and the alternative (saturated model) $-ln\hat L(\vec k; \vec k)$.
$P_{H} (\vec k) = \prod_{i} P(k_i,\mu)$; $P_{A} (\vec k) = \prod_{i} P(k_i, k_i)$
$d = \ln \frac{P_{A}(\vec k)}{P_{H}(\vec k)} = -\ln \frac{P_{H}(\vec k)}{P_{A}(\vec k)} = -\ln L(\mu; \vec k)-(-\ln\hat L(\vec k; \vec k))$
%% Cell type:code id:75234801-341f-4474-a2bf-93aafe9add77 tags:
``` python
mu = np.mean(summe)
print("mu:", mu)
print("P(H):", np.prod(scipy.stats.poisson.pmf(summe,mu)))
print("P(A):", np.prod(scipy.stats.poisson.pmf(summe,summe)))
print("-ln P(H):", -np.sum(scipy.stats.poisson.logpmf(summe,mu)))
print("-ln P(A):", -np.sum(scipy.stats.poisson.logpmf(summe,summe)))
print("d:", -np.sum(scipy.stats.poisson.logpmf(summe,mu)) + np.sum(scipy.stats.poisson.logpmf(summe,summe)))
```
%% Cell type:markdown id:e840a509-4f5b-48ef-9594-eb1dc0bbdc4a tags:
### What does this mean?
Which distribution of our test statistic $d$ do we expect for pseudo-data from our model and $\mu=<k>$?
%% Cell type:code id:c2d1227f tags:
%% Cell type:code id:e8dfbb1f tags:
``` python
#simuliere 1000 Spielzeiten
muobs = np.mean(summe)
tore = scipy.stats.poisson.rvs(muobs,size=(1000,306))
d = np.zeros(len(tore))
mu = np.zeros(len(tore))
for i in range(len(tore)):
mu[i] = np.mean(tore[i,:])
d[i] = np.sum(-scipy.stats.poisson.logpmf(tore[i],mu[i]) + scipy.stats.poisson.logpmf(tore[i],tore[i]))
plt.hist(mu, bins=50)
muobs = np.mean(summe)
plt.plot([muobs, muobs], [0, 100], linestyle = 'dotted')
plt.grid()
plt.xlabel("$\hat \mu$");
```
%% Cell type:code id:90e27f87-36fd-4b11-add1-c82a4eeb79dd tags:
``` python
dobs = -np.sum(scipy.stats.poisson.logpmf(summe,muobs)) + np.sum(scipy.stats.poisson.logpmf(summe,summe))
plt.hist(d, bins=50)
plt.plot([dobs, dobs], [0, 100], linestyle = 'dotted')
plt.grid()
plt.xlabel("$-\ln(P(H)/P(A))$")
plt.show()
```
%% Cell type:markdown id:f6db4bd4-de67-48e1-add9-4fdc05dbbcf7 tags:
### The $p$-value
Fraction of expected outcomes that have a higher value than we get for data:
%% Cell type:code id:18f65b9d tags:
%% Cell type:code id:2be2516a tags:
``` python
plt.hist(mu, bins=50)
plt.plot([muobs, muobs], [0, 100], linestyle = 'dotted')
plt.grid()
plt.xlabel("$\hat \mu$")
plt.show()
```
%% Cell type:code id:b82090eb tags:
%% Cell type:code id:a39e2f6f tags:
``` python
dobs = -np.sum(scipy.stats.poisson.logpmf(summe,muobs)) + np.sum(scipy.stats.poisson.logpmf(summe,summe))
plt.hist(d, bins=50)
plt.plot([dobs, dobs], [0, 100], linestyle = 'dotted')
plt.grid()
plt.xlabel("$-\ln(P(H)/P(A))$")
plt.show()
```
%% Cell type:code id:6647f14e-2282-4080-9139-c332db4af23a tags:
``` python
print("p:", np.sum(d > dobs)/len(d))
```
%% Cell type:markdown id:c835358c-1291-431c-b0ab-792a4b940d68 tags:
### Can we do this analytically, without Monte Carlo?
Simple case: Gaussian distributions:
$$d = -\ln L(\mu; \vec x)-(-\ln\hat L(\vec x; \vec x) = -\ln \prod_{i} \frac{G(x_i,\mu, \sigma)}{G(x_i, x_i, \sigma)}=-\ln \prod_{i} \frac{\exp(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2)}{\exp(-\frac{1}{2}(\frac{x_i-x_i}{\sigma})^2)} = -\ln \exp(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2) = \frac{1}{2} \sum_i\left(\frac{x_i - \mu}{\sigma}\right)^2 = \frac{1}{2}\chi^2$$
For Gaussian: $-2 \ln\frac{P(H}{P(A)}$ follows $\chi^2$
Wilk's Theorem:
$-2 \ln\frac{P(H}{P(A)}$ approaches a $\chi^2$ distribution (asymptotic limit) with $n$ degrees of freedom $n = \text{number of data points} - \text{number of model parameters}$
%% Cell type:markdown id:f09138b6 tags:
%% Cell type:markdown id:fb443d99 tags:
### Let's try it
- use $\chi^2$ p.d.f. and c.d.f. from [scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html)
- what is the value of `df`
- draw the $\chi^2$ distribution in the interval $[200, 425]$
- compute the $p$-value for our data
- does it work? Add the histgram for the simulated $\chi^2=2d$ values.
%% Cell type:code id:e609ea07 tags:
%% Cell type:code id:8f17b01a tags:
``` python
```
%% Cell type:code id:9d667a3b-6a89-419a-8be4-79cb7de96151 tags:
``` python
plt.hist(2*d, bins=50, density=True)
plt.plot([2*dobs, 2*dobs], [0, 0.02], linestyle = 'dotted')
ds = np.linspace(200, 425, 100)
plt.plot(ds,scipy.stats.chi2.pdf(ds, 305))
print("p-Wert via Chi2:", scipy.stats.chi2.sf(2*dobs, 305))
plt.grid()
plt.xlabel("$-2\ln(P(H)/P(A))$")
plt.show()
```
%% Cell type:markdown id:9f258f33 tags:
%% Cell type:markdown id:72272998 tags:
### How are the $p$-values using Wilk's theorem from the simulation distributed?
%% Cell type:code id:cda0957a-414f-4bea-ab33-e61218a4a4f5 tags:
``` python
plt.hist( scipy.stats.chi2.sf(2*d, 305), bins=50, density=True)
plt.plot([ scipy.stats.chi2.sf(2*dobs, 305), scipy.stats.chi2.sf(2*dobs, 305)], [0, 12], linestyle = 'dotted')
plt.grid()
plt.xlabel("p")
plt.show()
```
%% Cell type:markdown id:9378e3ba tags:
%% Cell type:markdown id:2f72dcc4 tags:
### What distribution should we get if it works?
%% Cell type:markdown id:59be1788 tags:
%% Cell type:markdown id:ac6f96a5 tags:
For the simulation:
%% Cell type:markdown id:5d172ac9 tags:
%% Cell type:markdown id:ec6272b7 tags:
For the $chi^2$ distribution:
%% Cell type:code id:a096afa1 tags:
%% Cell type:code id:15bf969e tags:
``` python
pvalues = [np.sum(d > dsim)/len(d) for dsim in d]
plt.hist(pvalues, bins=50, density=True)
plt.grid()
plt.show()
```
%% Cell type:code id:648b37eb tags:
%% Cell type:code id:b674636f tags:
``` python
d2 = scipy.stats.chi2.rvs(305, size=10000)
pvalues = [np.sum(d2 > d2sim)/len(d2) for d2sim in d2]
plt.hist(pvalues, bins=50, density=True)
plt.grid()
plt.show()
```
%% Cell type:markdown id:557559e1-8e4a-412a-9486-a2fb9b97de90 tags:
### Does is work better for large $\mu$ (hand ball?
- simulate 1000 season for $\mu_\text{obs} = 35$
- and rerun the tests
%% Cell type:code id:770f4472 tags:
%% Cell type:code id:82ea4c91 tags:
``` python
```
%% Cell type:code id:472aac93-e2dd-424a-ba0e-51c10d409f8e tags:
``` python
#simuliere 5000 Spielzeiten
muobs = 35
tore = scipy.stats.poisson.rvs(muobs,size=(5000,306))
d = np.zeros(len(tore))
mu = np.zeros(len(tore))
for i in range(len(tore)):
mu[i] = np.mean(tore[i,:])
d[i] = np.sum(-scipy.stats.poisson.logpmf(tore[i],mu[i]) + scipy.stats.poisson.logpmf(tore[i],tore[i]))
plt.hist(mu, bins=50)
plt.grid()
plt.xlabel("$\hat \mu$")
plt.show()
plt.hist(2 * d, bins=50, density=True)
ds = np.linspace(200, 425, 100)
plt.plot(ds,scipy.stats.chi2.pdf(ds, 305))
plt.grid()
plt.xlabel("$-2\ln(P(H)/P(A))$")
plt.show()
plt.hist( scipy.stats.chi2.sf(2*d, 305), bins=50)
plt.grid()
plt.xlabel("p")
plt.show()
```
%% Cell type:markdown id:450edccb-2c13-4400-9613-4a1f3e069ce6 tags:
### Pearson's $\chi^2$-Test
What are the confidence regions for $n$ degrees of freedom/dimensions:
data: $\vec y = y_1,\dots, y_n$
predicted values from model: $\vec{f} = f_1,\dots,f_n$
$$\chi^2 = (\overrightarrow{y\ } - \vec{f})^{T} V(\vec{y}- \vec{f})$$
%% Cell type:code id:f703d4ed tags:
%% Cell type:code id:335523f9 tags:
``` python
def conv_chi2_nd(z, n):
return scipy.stats.chi2.cdf(z,n)
for z in [1,2,3,4,5]:
print(z, conv_chi2_nd(z, 1), conv_chi2_nd(z, 2), conv_chi2_nd(z, 3))
```
%% Cell type:code id:2fd5c302-732b-4c6b-b224-5b9f79df57cc tags:
``` python
print("critical chi2 values")
for k in [1, 2, 3]:
c = conv_gaus_nd(k, 1)
print(c, scipy.optimize.brentq(lambda z: conv_chi2_nd(z, 1)-c,0, 30), scipy.optimize.brentq(lambda z: conv_chi2_nd(z, 2)-c,0, 30), scipy.optimize.brentq(lambda z: conv_chi2_nd(z,3)-c,0, 30))
```
%% Cell type:code id:23626cec-4198-44f4-a1c8-81e36992268f tags:
``` python
for c in [0.68, 0.90, 0.95, 0.99]:
print(c, scipy.optimize.brentq(lambda z: conv_chi2_nd(z, 1)-c,0, 30), scipy.optimize.brentq(lambda z: conv_chi2_nd(z, 2)-c,0, 30),scipy.optimize.brentq(lambda z: conv_chi2_nd(z,3)-c,0, 30))
```
%% Cell type:markdown id:69556eab-a67e-4251-a4b3-0290013aa344 tags:
# Schätzer
## Fehlerrechnung
### Fehlerrechnung: Beispiel I
Widerstandsmessung: $U = 24$ V und $I = 0{,}6 \pm 0{,}1$ A
Annahme: $I$ gaußverteilt $g(I)$ mit $\mu_I = 0{,}6$ und
$\sigma_I = 0{,}1$
Was ist $p(R)$?
$R = U/I$, $I = U/R$ und $|dI/dR| = U/R^2$
$$\begin{aligned}
p(R) & = & U/R^2 g(I(R)) = \frac{U}{R^2 \sqrt{2\pi}\sigma_I}e^{-\frac{(U/R-\mu_I)^2}{2\sigma_I^2}} \\
& = & \frac{U}{R^2 \sqrt{2\pi}\sigma_I}e^{-\frac{(R-U/\mu_I)^2}{2\sigma_I^2*R^2/\mu_I^2}} \\
\end{aligned}$$ (show in Jupyter)
%% Cell type:markdown id:06ff9896-15bc-4404-a789-bbe7557350eb tags:
### Fehlerrechnung: Beispiel II
Spannungsmessung: $R = 40$ $\Omega$ und $I = 0{,}6 \pm 0{,}1$ A
Annahme: $I$ gaußverteilt $g(I)$ mit $\mu_I = 0{,}6$ und
$\sigma_I = 0{,}1$
Was ist $p(U)$?
$U = RI$, $I = U/R$ und $|dI/dU| = 1/R$
$$\begin{aligned}
p(U) & = &1/R g(I(R)) = \frac{1}{R \sqrt{2\pi}\sigma_I}e^{-\frac{(U/R-\mu_I)^2}{2\sigma_I^2}} \\
& = & \frac{1}{\sqrt{2\pi}R\sigma_I}e^{-\frac{(U-R\mu_I)^2}{2R^2\sigma_I^2}} \\
& = & g(U) \text{ mit $\mu_U = R\mu_I$ und $\sigma_U = R\sigma_I$}\\
\end{aligned}$$
%% Cell type:markdown id:0fe93c64-f06b-49ec-addb-22b12cf06ec6 tags:
### Fehlerrechung: Allgemein
Gaußsche Fehlerfortpflanzung: Annahme: $x$ gaußverteilt $g(x)$ mit
$\mu_x$ und $\sigma_x$
Was ist $p(y(x))$?
$y(x) \approx f(\mu_x) + \frac{dy}{dx}\big|_{x=\mu_x}(x-\mu_x)$
$x - \mu_x \approx (y-y(\mu_x))(\frac{dy}{dx}\big|_{\mu_x})^{-1}$,
$|dx/dy| = |\frac{dy}{dx}(\mu_x)|^{-1}$ $$\begin{aligned}
p(y) & \approx & |\frac{dy}{dx}(\mu_x)|^{-1} g(x(y)) = \frac{1}{\frac{dy}{dx}\big|_{\mu_x}\sqrt{2\pi}\sigma_x}e^{-\frac{(x(y)-\mu_x)^2}{2\sigma_x^2}} \\
& = & \frac{1}{\frac{dy}{dx}\big|_{\mu_x}\sqrt{2\pi}\sigma_x}e^{-\frac{(y-y(\mu_x))^2}{2(\frac{dy}{dx}\big|_{\mu_x})^2\sigma_x^2}}\\
& = & g(y) \text{ mit $\mu_y = y(\mu_x)$ und $\sigma_y = \frac{dy}{dx}\Big|_{\mu_x}\sigma_x$}\\
\end{aligned}$$
%% Cell type:markdown id:85f35e43-f5cd-4ccf-acd2-cbb5e963ce00 tags:
### Fehlerrechung: Mehrere Dimensionen
Gaußsche Fehlerfortpflanzung: Annahme: $\vec x$ gaußverteilt $g(\vec x)$
mit $\vec \mu$ und Kovarianz $C$
Was ist $p(\vec y(\vec x))$?
$y_i(\vec x) \approx y_i(\vec \mu) + \frac{dy_i}{dx_j}\big|_{\vec \mu} (x_j- \mu_j)$;
$$\begin{aligned}
E([y_i y_j]) \approx & y_i(\vec \mu)y_j(\vec \mu) + y_i(\vec \mu)\frac{dy_j}{dx_k}\big|_{\vec \mu} E[x_k - \mu_k] + y_j(\vec \mu)\frac{dy_i}{dx_k}\big|_{\vec \mu}E[x_k- \mu_k] \\
& + \frac{dy_i}{dx_k}\big|_{\vec \mu}\frac{dy_j}{dx_l}\big|_{\vec \mu}E[(x_k- \mu_k) (x_l - \mu_l)]\\
= & y_i(\vec \mu)y_j(\vec \mu) + \frac{dy_i}{dx_k}\big|_{\vec \mu}\frac{dy_j}{dx_l}\big|_{\vec \mu}C_{kl} \text{ und }V[y_iy_j] = E[y_iy_j] - E[y_i]E[y_j]
\end{aligned}$$ mit $A_{ij} = [ \frac{dy_i}{dx_j}\big|_{\vec \mu}]$ ist
$\vec y(\vec x)$ gaußverteilt mit
$$\vec \mu_y = \vec y(\vec \mu_x) \text{ und Kovarianz } C_{yy} = A C_{xx} A^T$$
%% Cell type:markdown id:8507d724-f9f8-4d41-a3b0-3ddaf0846dde tags:
## Stichproben
### Schätzer: Mittelwert aus Stichprobe
Beispiel: Schätze $I$ aus drei Messungen der Stromstärke $I_1$, $I_2$,
$I_3$ mit gleichem $\sigma_I = 0.1$ A.
Annahme: Messungen gauß-verteilt um I.
Schätzer für $\mu$: Mittelwert: $$\hat I = 1/3(I_1+I_2+I_3)$$ Fehler für
unkorrellierte Messungen (keine syst. Fehler):
$$C = \left(
\begin{array}{rrr}
\sigma_I^2 &0 & 0 \\
0 & \sigma_I^2 & 0\\
0 & 0 & \sigma_I^2
\end{array} \right) \text{ und } A = \left( \begin{array}{rrr}
d\hat I/dI_1 & d\hat I/dI_2 & d\hat I/dI_3
\end{array} \right)$$
%% Cell type:markdown id:6463365e-354d-4e98-8f63-6903e2dbd6e6 tags:
### Beispiel: Fehlerfortpflanzung
Fehler für unabhängige Messungen (keine syst. Fehler):
$$C = \left(
\begin{array}{rrr}
\sigma_I^2 &0 & 0 \\
0 & \sigma_I^2 & 0\\
0 & 0 & \sigma_I^2
\end{array} \right) \text{ und } A = \left( \begin{array}{rrr}
1/3 & 1/3 & 1/3
\end{array} \right)$$ Fehlerfortpflanzung: $$\begin{aligned}
C_{\hat I} & = & ACA^T = \left(
\begin{array}{rrr}
1/3 & 1/3 & 1/3
\end{array}
\right)
\left(
\begin{array}{rrr}
\sigma_I^2 &0 & 0 \\
0 & \sigma_I^2 & 0\\
0 & 0 & \sigma_I^2
\end{array} \right)
\left( \begin{array}{r}
1/3 \\ 1/3 \\ 1/3
\end{array}
\right)\\
& = & \left(
\begin{array}{rrr}
\sigma_I^2/3 & \sigma_I^2/3 & \sigma_I^2/3
\end{array}
\right)
\left( \begin{array}{r}
1/3 \\ 1/3 \\ 1/3
\end{array}
\right) = \sigma_I^2/9 + \sigma_I^2/9 + \sigma_I^2/9 = \sigma_I^2/3 \\
\hat \sigma_I & = & \sigma_I/\sqrt{3}
\end{aligned}$$
%% Cell type:markdown id:a038c48f-7143-480d-b0ba-817f0e5211c1 tags:
### Schätzer
Schätzer $\hat a$ für $a$ aus Stichprobe $x_1,\dots x_n$
Anforderungen:
- erwartungstreu:
$E[\hat a]= a$
- konsistent:
$\lim_{n\to \infty } \hat a = a$
- effizient: $V[\hat a]$ möglichst klein
Aufgabe: Schätze Mittelwert $\mu$ und Varianz $\sigma^2$ einer
Wahrscheinlichkeitsdichtefunktion $p(x)$ aus einer Stichprobe
$x_1,\dots,x_n$
%% Cell type:markdown id:d32672c8-da0b-468b-bf73-df40aef7245a tags:
### Schätzer für Mittelwert
Schätzer für den Mittelwert $\mu$:
$$\hat \mu = \bar x = \frac{1}{n}\sum_{i=1}^n x_i$$
Tests:
- erwartungstreu:
$$E[\hat \mu]= E[ \frac{1}{n}\sum_{i=1}^n x_i ] = \frac{1}{n} \sum_{i=1}^n E[x_i] = \frac{1}{n} \sum_{i=1}^n \mu = \mu$$
- konsistent:
$$\lim_{n\to \infty } \hat \mu = \lim_{n\to \infty } \frac{1}{n}\sum_{i=1}^n x_i = \int x p(x)\,dx = \mu$$
%% Cell type:markdown id:4021d6cd-150e-4a6d-bdc4-ed049fd983d3 tags:
### Schätzer für Mittelwert
Schätzer für den Mittelwert $\mu$:
$$\hat \mu = \bar x = \frac{1}{n}\sum_{i=1}^n x_i$$
Test:
- effizient: $V[\hat a]$ möglichst klein
$$V[\hat \mu] = V[ \frac{1}{n}\sum_{i=1}^n x_i ] = \frac{1}{n^2} V[ \sum_{i=1}^n x_i ] = \frac{1}{n^2} \sum_{i=1}^n\sum_{j=1}^n Cov(x_i, x_j)$$
$$V[\hat \mu] = \frac{1}{n^2} \sum_{i=1}^n Cov(x_i, x_i)= \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$$
(oder über zentralen Grenzwertsatz)
%% Cell type:markdown id:4dfd3266-a5a2-4db7-a96b-ef21118bc85e tags:
### Schätzer für Varianz
Schätzer für die Varianz $\sigma^2$:
$$\hat \sigma^2 = V[x] = \frac{1}{n}\sum_{i=1}^n (x_i - <x>)^2$$
Tests:
- erwartungstreu:
$$E[\hat {\sigma^2}]= E[ \frac{1}{n}\sum_{i=1}^n (x_i^2 - \bar x^2)] = \frac{1}{n} \sum_{i=1}^n E[x_i^2-\bar x^2] = E[x^2] - E[\bar x^2]$$
$$E[\hat{\sigma^2}]= E[x^2] - E[x]^2 + E[x]^2 - E[\bar x^2] = E[x^2] - E[x]^2 - (E[\bar x^2] - E[\bar x]^2)$$
$$E[\hat {\sigma^2}] = V(x) - V(\bar x) = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2 \ne \sigma^2$$
- konsistent:
$\lim_{n\to \infty } \hat {\sigma^2} = \lim_{n\to \infty } \frac{n-1}{n}\sigma^2 =\sigma^2$
%% Cell type:markdown id:1621af0b-8216-4959-9cc9-8c41cdb8cc79 tags:
### Stimmt das Modell?
Der angewandte Schätzer basiert immer auf Annahmen über die analysierte
Stichprobe.
Der Fehler auf den Schätzwert sagt nichts darüber aus, ob die Annahmen
stimmen.
%% Cell type:markdown id:8fdd0fc7-af9c-492c-9c7f-5a195ee58c50 tags:
### Nochmal die Stromstärke
Stromstärke aus zwei Messreihen, $\sigma_I = 0.1$ A:
Reihe 1: $[0.64441099 0.63895571 0.73984042 0.62145699 0.66971489]$
Reihe 2: $[0.93179978 0.46326547 0.41350898 0.12281948 0.61426579]$
Schätzer: $I_1 = 0.66 \pm 0.04$, $I_2 = 0.51 \pm \pm 0.04$
Passen die Daten zur Erwartung?
Residuum: $$R_i = \frac{x_i - \mu}{\sigma}$$ ist normal verteilt, wenn
$\mu$ und $\sigma$ stimmen.
(Betrachte Summe der Residuenquadrate $\sum R_i^2$)
%% Cell type:markdown id:0c34b270-4e04-40be-82b1-f305eadf82a9 tags:
### $\chi^2$-Verteilung
0.5
$$\chi^2 = \sum_i \frac{x_i - \mu_i}{\sigma_i}$$ Die $p(\chi^2, n)$ ist
die Wahrscheinlichkeitsdichte für die Summe der Quadrate von $n$
normalverteilten Zufallszahlen.
Mittelwert: $<\chi^2> = n$ Zahl der Freiheitsgrade.
0.5 <embed src="./figures/10/chi2.pdf" />
$p$-Wert (Wahrscheinlichkeit für größeres $\chi^2$):
$$p = 1- \int_0^{\chi 2} p(\chi^2, n)$$
%% Cell type:markdown id:3dbe4a35-771e-4129-9529-76f3d38fddae tags:
## Zusammenfassung und Ausblick
Zusammenfassung
- Wahrscheinlichkeitsdichten
- Fehlerrechnung (lineare Näherung)
- Korrelationen nicht vernachlässigen
- Schätzer
- p-Wert
- Literatur:
- Glen Cowan, Statistical Data Analysis,
[pdf](https://www.sherrytowers.com/cowan_statistical_data_analysis.pdf)
- Roger John Barlow, Statistics: A Guide to the Use of Statistical
Methods in the Physical Sciences,
[Skript](https://arxiv.org/pdf/1905.12362.pdf)
- Volker Blobel, Erich Lohrmann, Statistische und numerische
Methoden der Datenanalyse,
[pdf](https://www.desy.de/~sschmitt/blobel/eBuch.pdf)
# Bibliography
Bibliography
......
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment