Prev: Distributions | Next: The t distribution

**Normal**(a.k.a.: 'Gausian', 'bell curve') distribution is of major importance in statistics and in hypothesis testing specifically. The probability density function for normal distribution in scipy is written as:`norm.pdf(x) = exp(-x**2/2)/sqrt(2*pi)`

- To work with statistical distributions we import the
**scipy.stats**package. Usually we import the whole scipy.stats package - However, importing specific distribution classes is also possible

In [1]:

```
import scipy.stats as stats # Imports the entire scipy.stats (all distributions)
from scipy.stats import norm # imports only the normal distribution
from scipy.stats import t # imports only the Studnet's t distribution
from scipy.stats import f # imports only the Fisher's f distribution
```

In [2]:

```
# Get help by printing the distribution docstring as follows:
# print(stats.norm.__doc__) # or print(stats.t.__doc__) etc.
# print(stats.norm.__doc__)
```

Import

**norm**and use some basic distribution statistical functions like:**median**=>*median()***mean**=>*mean()***standard deviation**=>*std()***variance**=>*var()*

**stats()**returns the**first four moments**of the distribution: mean, std, skew and kyrtosis (for a reminder of what 'moments' are in statistics read here)

In [3]:

```
from scipy.stats import norm
# get median, mean, standard deviation and var
print(norm.median(), norm.mean(), norm.std(), norm.var())
# or use stats() function
m, v, s, k = norm.stats(loc=0, scale=1, moments='mvsk')
print(m, v, s, k)
```

- But why do we get these '0's and '1's?

- If you do not do anything else distributions are in a standardized form using the
**standard scores**(a.k.a: z scores, z values). A z score is calculated by the transformation formula:**z = (x - M)/SD**, where M: mean and SD: standard deviation

Thus, z scores are independent from any specific measurements and only describe how far from M is the x value in terms of SD

- Standardized also means that the
**mean**and the**standard deviation (SD)**of the distribution are set to '0' and '1' respectively. - If you want to shift and/or scale the distribution you have to pass arguments and give new values to the distribution
*shape parameters*:**loc**and**scale****loc**represents the**mean****scale**represents**SD**

In [4]:

```
# loc and scale examples
print(norm.var(loc=2, scale=4)) # prints the variance of a norm distro with M=2 and STD=4
# print the mean and variance of a norm distro with M=1.5 and STD=0.5
m, v = norm.stats(loc=1.5, scale=0.3, moments='mv')
print(m, v)
```

- When using the
**norm()**constructor essentially we refer to a whole family of normal distributions with varied 'shape parameters' (mean and standard deviation) - It is better to '
**freeze**' the distribution by constructing a non-varied normal distribution with specific M and SD. - To do so, we need to
**call the norm() constructor and pass as arguments the values of loc and scale**(that is, define M and SD) - After this we can use the identifier (name) of the frozen distribution without cintinuously passing as arguments the shape parameters

In [5]:

```
rv = norm(loc=2, scale=4) # Use 'rv' to refer to the frozen norm distro with M=2 and SD=4
m, v, s, k = rv.stats(moments='mvsk')
print( m, v, s, k)
```

- We can get one or more random values from a distribution with the
**rvs**(random variates) method or the**scipy.randn()**function.

In [6]:

```
from scipy.stats import norm
obs = norm.rvs(loc=0, scale=1, size=10)
obs
```

Out[6]:

- Similarly we can call the
**scipy.randn()**function to get a number of random points form the normal distribution - The
**describe()**function can give us some sample statistics

In [7]:

```
import scipy as sp
s = sp.randn(100)
sp.stats.describe(s)
```

Out[7]:

In [8]:

```
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
%matplotlib inline
# A simplistic way to do it
x = np.linspace(-5,5,100) # define a big enough x interval
nd = norm.pdf(x) # get the norm.pdf for x interval
plt.plot(x,nd) # plot it!
```

Out[8]:

- Why is it simplistic? For one reason because we made the assumption that the [-5,5] interval fits well to describe the normal distribution (and we were right!).
- But what if we don't really know the practical limits of the x values for a specific distribution?
- The answer is that we always know that:
- ...for any distribution the ppf() function returns a x value that corresponds to the probability that this value appears.
- ...near the lower or higher limit of the value range of the distribution the cumulative probability (cdf) becomes very small (almost zero) or very large (almost 1) respectively.

Thus, the [ppf(0.0001), ppf(0.9999)] value range should return x values close enough to the lower or higher end of the x values of the distribution. So, we rewrite the code as follows:

In [9]:

```
# A better approach
rv = norm(loc=0, scale=1) # Freeze the norm distribution
x = np.linspace(rv.ppf(0.0001), rv.ppf(0.9999), 100)
y = rv.pdf(x)
plt.plot(x,y)
```

Out[9]:

- As already discussed, a common issue when using any distribution is testing whether
**the probability that the variate x takes a value that is less/greater than a cutoff probability 'a'** - Knowing this probability important conclusions can be drawn regarding hypothesis testing.
- Obviously, the sf function will help finding the answer.

In [10]:

```
a = 0.05 # set the cutoff
rv = norm(loc=0, scale=1)
x = np.random.normal(size=1)
p = rv.sf(x) # equal but sometimes more accurate than '1-rv.cdf(x)'
if p < a:
print('Cutoff at: ', x, p)
else:
print('No cutoff', x, p)
```

. Free learning material

. See full copyright and disclaimer notice