Prev: - | Next: The normal distribution

- As explained a
**distribution**of a variable is*an expression of the probability that when measuring the variable the result will lie within a specific range of values*. - Depending on how the variable varies, distribution can be either:
**Discrete**- The variable can take only
**specific countable levels of values**(is a 'quantized' variable) - Example: when throwing a dice the outcome is a discrete (quantized) variable: it can only take six values. The probability distribution (for a fair dice) is constant: 1/6
- Scipy.stats documentation on discrete statistical distributions

- The variable can take only
**Continuous**- The variable can take
**any value ('varies continuously')**between a minimum and a maximum value. The probability distribution varies accordingly and is usually described by some**mathematical expression** - Example: when measuring each individual's height in a group of people the variable 'height' varies between a minimum and a maximum value; there are no preset levels for this variable.
- Scipy.stats documentation on continuous statistical distributions

- The variable can take

- Before starting code work with distributions I explain some important concepts implemented as
**scipy.stats methods/functions**to help understand how distributions work. - We focus on
*continuous univariate*('of one variable') distributions but much of what is presented is valid also for discrete distributions.

*pdf*: Probability Density Function*cdf*: Cumulative Distribution Function*ppf*: Percent Point Function (Inverse of cdf)*sf*: Survival Function (1-cdf)*isf*: Inverse Survival Function (Inverse of sf)

These are important methods to work with and you should gradually get familiar with them if you are serious with computational statistics. Below we concisely explain the meaning and operation of them. Coding examples are given in next sections discussing specific distributions.

- pdf returns the probability that the variable x takes a specific value (more correctly: lies between a range of values)

- The
**area below the pdf curve**which is enclosed between two values x_{0}and x_{f}represents the cumulative probability that x lies within this specific range of values - For example, the greyed area on the right (in the pdf figure above) represents the probability that x is larger than 2 (lies in the interval (2, +inf). Respectively, the greyed area on the left represents the probability that x is less or equal to -2 (lies in (-inf, -2])
- This notion is crucial in understanding hypothesis testing and we will come back later when discussing the meaning of hypothesis controls.

- Given a value x
_{0}cdf returns the cumulative probability that x gets a value less or equal to x_{0}, or in other words "lies in the interval (-inf, x_{0}]". Observe the cumulative p (y axis) values from 0 to 1 in the cdf graph below.

- ppf returns the value x of the variable that has a given cumulative distribution probability (cdf). Thus, given the cdf(x) of a x value, ppf returns the value x itself, therefore, operating as the inverse of cdf

- The graph on the left denotes that for a given x value the respective cdf(x) value is input to the ppf function, and ppf returns once again the original x value.
- The graph on the right is essentially the cdf graph above with inverse x and y axes. For a given cdf(x) (aka: cumulative p value) ppf returns the original x value.

- sf returns the probability that the variate x gets a value that is greater than a specific value x
_{0} - sf will be very useful in estimating statistical significance in hypothesis testing

- The greyed area in the graph above represents the probability that the variable x gets a value greater than the specific x
_{0}value, or in other words: "lies in the interval (x_{0}, +inf)"

. Free learning material

. See full copyright and disclaimer notice