Prev: - | Next: t-Test (Independent samples)

Descriptive statistics¶

Here, we are going to explore various possibilities for computing and plotting descriptive statistical measures of our sample.
A detailed explanation of what 'desciptive statistics' is all about is available here _{("Trochim, William M. The Research Methods Knowledge Base, 2nd Edition)}
First, we are going to import some sample data from an external file.

(A) Import data¶

The scenario: suppose we conducted an experiment involving two student groups in two different study conditions. One group studied under normal/typical conditions ('Control') and the other under the impact of the investigated independent factor (for example, a new kind of learning technology) ('Treatment').
We read data from an external file representing the performance of the two group students in a post-test control instrument.
Data are represented in our code as a pandas DataFrame object named 'data' containing two columns: 'Control' and 'Treatment'. Data indices are simply integers [0-39].

import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv("../../data/researchdata.csv", sep=' ')   # why do we need sep? check data.csv
print(data.head())   # head() method prints only the first 5 lines (for less/more: provide a numerical argument)
print('\n..............\n')
print(data.tail())   # tail() method prints only the last 5 lines (for less/more: provide a numerical argument)

   Control  Treatment
0       60         72
1       65         73
2       70         82
3       75         83
4       70         82

..............

    Control  Treatment
33       45         57
34       50         62
35       55         65
36       60         72
37       65         74

#We get 'Control' group as case study for descriptive statistics
dc = data.Control

(B) Apply descriptive stats¶

Descriptive statistics concern mainly the way data are:
1) Distributed across a range of values
2) Exhibit a tendency to center around specific values (mean, median, mode), and
3) Exhibit variation and disperse around these specific values

1. Data distribution¶

Usually we are interested in data frequency distribution. In our scenario we can obtain frequency distribution by using the value_counts() DataFrame method that returns a Series object with:
- index: the frequency of unique values in the Series
- values: the unique values themselves

frd = dc.value_counts()
frd

65    8
70    7
60    6
75    5
55    2
50    2
45    2
90    2
85    2
80    2
Name: Control, dtype: int64

Presenting the distribution as a plot is easy with the matplotlib plot() method

x = frd.index.values-1.5           # positioning the cyan bars 1.5 point to the left
pd = plt.bar(x, frd, width=3.0, color='cyan')

If you want your data presented as a table then you can:

dfcontr = frd.to_frame()             # a) convert 'frd' Series to DataFrame with Series.to_frame()
dfcontr.index.name = 'Values'        # Set 'dfcontr' DataFrame index name
dfcontr.columns = ['Frequency']      # Set 'dfcontr' column name

from IPython.display import HTML     # import HTML library 
HTML(dfcontr.to_html())              # Call to_html() method to print the tabular form of DataFrame

2. Measures of central tendency¶

You can get the data set most common measures of central tendency by using:
- (a) Appropriate numpy/scipy specific statistical funstions and/or
- (b) the describe() function

Statistical functions: mean(), median(), std(), var(), min(), max(), skew(), kurt()

dc.mean(), dc.median(), dc.std(), dc.var(), dc.min(), dc.max(), dc.skew(), dc.kurt()

(67.23684210526316,
 65.0,
 11.13172775139386,
 123.91536273115221,
 45,
 90,
 0.065827061659537792,
 -0.082074115012775145)

# Use describe()
dc.describe()

count    38.000000
mean     67.236842
std      11.131728
min      45.000000
25%      60.000000
50%      65.000000
75%      75.000000
max      90.000000
Name: Control, dtype: float64

Cumulative sum of the values in the 'dc' Series

dc.cumsum().tail()

33    2325
34    2375
35    2430
36    2490
37    2555
Name: Control, dtype: int64

Number of non-NaN values

dc.count()

38

Correlation: DataFrame.corr() returns the pairwise correlation between DataFrame columns. You can define the method for computing correlation coefficients by setting the 'method' value: method:{‘pearson’, ‘kendall’, ‘spearman’} (default is 'pearson')

data.corr(method='spearman')

Covariance: DataFrame.cov() returns the covariance coefficient of a column pair, that is a measure of how much the column variables vary together.

data.cov()

What is the difference between covariance and correlation? Read explanation in this article

3. Variance and Standard Deviation¶

Suppose we have a set of N data values (denoted as x_i) with mean M. Then, the variance of the mean is the average squared 'distance' of x_i values from the mean M, thus providing a first measure of the data dispersion around the mean see, figure below).

However, the standard deviation is the measure typically used in statistics to calculate and express the degree of dispersion of a set of data and it is defined as the square root of the variance:
```
      standard deviation = sqrt(variance) 
```
The important thing you need to know about standard deviation, is: standard deviation of popoulation mean ('σ') is calculated slightly different from standard deviation of sample mean ('s') as you can see in the figure below.

If you want to go deeper in standard deviation you can read more following the links bellow:
- Standard deviation @wikipedia
- How to Measure Variability in a Data Set @Stat Trek

(C) Calculating your own standard deviation¶

Standard deviation functions of the various statistical packages most often return the 'SD of sample mean' deviation ('s') dividing by N-1.
However, the benfit of working with a real programming language is that we can build our own algorithm for computing standard deviation and divide by whatever we need. This is what we do in the following.

import numpy as np
import statistics as st 
import pandas as pd


def mystd(dt, kind='sample'):       # kind: {sample, population, unbiased}
    # N: data size
    N = len(dt)
    #print(N)
    
    # M: mean
    M = dt.mean()
    #print(M) 
    
    #SS: Sum of Squares
    SS = np.sum(pow((dt-M),2))
    #print(SS) 
    
    #variance
    if kind == 'sample':
        var = SS/(N-1)
    elif kind == 'population':
        var = SS/N
    elif kind == 'unbiased':
        var = SS/(N-1.5)
    else:
        return 'kind should be {population, sample, unbiased}'
    
    return np.sqrt(var)

ar = np.array([1,2,3,4,5,6,7,8,9,10])
print(ar.std())
print(st.stdev(ar))
print(mystd(ar, kind='sample'))

2.87228132327
3.0276503540974917
3.0276503541

As a practice¶

In the above code we print three different 'std' values but only two of them are the same. Why?
Check the documentation of numpy.std and statistics.stdev to deeper understand how the std functions work
Furthermore: read the data from 'researchdata.csv' file and compute the standard deviation of the Control group data set (use the pandas DataFrame std method). Which type of standard deviation does this method compute? How can you change this?

Copyright¶

. Free learning material
. See full copyright and disclaimer notice

	Frequency
Values
65	8
70	7
60	6
75	5
55	2
50	2
45	2
90	2
85	2
80	2

	Control	Treatment
Control	1.000000	0.991768
Treatment	0.991768	1.000000

	Control	Treatment
Control	123.915363	118.627312
Treatment	118.627312	114.790896