Home      |       Contents       |       About

Prev: Plotting a Series/DataFrame object       |      Next: -

GroupBy: 'group and conquer'

What is 'GroupBy'?

  • Suppose you have collected your data from your educational technology research project. Your subjects are school students who have been working following some innovative technology-supported didactic model. You have entered your data in a spreadsheet and now you are ready to proceed with processing. Naturally, in your student samples there are both boys and girls and you think that it would be nice if you could group your students based on gender and do some boys-girls group comparisons. Furthermore, you have data coming from classes of different age/level (say, K6 and K9). Wouldn't it be nice to further group your data based also on students' level? Here is where 'groupby' comes in.
  • 'groupby' is a pandas powerful method for grouping and dividing your original data into subgroups, based on one or more grouping factor(s) that you consider important (like gender and age in the above scenario). After grouping you can proceed and implement all available numpy/scipy/pandas-based statistics in the subgroups thus increasing the power of your analysis. Of course, you need to additionally enter in your data spreadsheet the information necessary for grouping (the grouping factor(s)).

Exploring 'groupby'

  • We are going now to input some sample data to explore the potential offered by 'groupby'
  • Data in the spreadsheet are organized as you see in the figure below. Note the column titles: 'Gender', 'Level' and "Performance'.
  • As a first step read the sheet data from the file into a pandas DataFrame
In [1]:
import numpy as np 
import pandas as pd 

data = pd.read_excel('../../data/researchdata.xlsx', sheetname="groupby")
data.head(5)
Out[1]:
Gender Level Performance
0 b K6 60
1 b K6 65
2 g K6 70
3 g K6 75
4 g K6 70
  • Now group data with 'Gender' as grouping factor. In this first example we apply a one-factor grouping (group by 'Gender'). We will later explore grouping with two factors.
In [2]:
grby_obj = data.groupby('Gender')
grby_obj
Out[2]:
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000004282DA0>
  • You see that grby_obj has been constructed as a 'DataFrameGroupBy' object and printing this type of object does not displsy anything unless some information about the constructed object.
  • We can, however, taste a bit of the grby_obj object by exploring available methods and attribute through dir()
In [3]:
# Change the slicing below to see more of the object attributes
print(dir(grby_obj)[:20])
['Gender', 'Level', 'Performance', '__bytes__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__']

Some first useful methods of 'groupby' object

  • len() returns the number of groups which are formed
In [4]:
len(grby_obj)
Out[4]:
2
  • Same with .ngroups
In [5]:
grby_obj.ngroups
Out[5]:
2
  • .groups will return a dictionary with name of groups as keys and an array of indexes identifying the data items included in the group
In [6]:
print(grby_obj.groups.keys(),'\n')
print(grby_obj.groups.values(),'\n')

print(grby_obj.groups['b'])
dict_keys(['g', 'b']) 

dict_values([[2, 3, 4, 9, 10, 11, 12, 18, 19, 20, 22, 23, 24, 28, 29, 30, 31, 32, 38], [0, 1, 5, 6, 7, 8, 13, 14, 15, 16, 17, 21, 25, 26, 27, 33, 34, 35, 36, 37, 39]]) 

[0, 1, 5, 6, 7, 8, 13, 14, 15, 16, 17, 21, 25, 26, 27, 33, 34, 35, 36, 37, 39]
  • See group statistics by calling describe()
In [7]:
grby_obj.describe()
Out[7]:
Performance
Gender
b count 21.000000
mean 59.761905
std 7.154752
min 45.000000
25% 55.000000
50% 60.000000
75% 65.000000
max 70.000000
g count 19.000000
mean 75.789474
std 7.685332
min 60.000000
25% 70.000000
50% 75.000000
75% 80.000000
max 90.000000
  • size() returns the size of the groups
In [8]:
grby_obj.size()
Out[8]:
Gender
b    21
g    19
dtype: int64
  • .count() provides similar info as size() displaying also all columns
In [9]:
grby_obj.count()
Out[9]:
Level Performance
Gender
b 21 21
g 19 19
  • Using nth() you can retrieve any specific item in the group by specifying its order within the group
In [10]:
grby_obj.nth(2)
Out[10]:
Gender Level Performance
Gender
b b K6 65
g g K6 70
  • Iterating over the groupby object returns a tuple containing the name and the respective group (a DataFrame in itself)
In [11]:
for gp_name, gp in grby_obj:
    print(gp_name)
    print(gp.head(3),'\n',type(gp),'\n')
b
  Gender Level  Performance
0      b    K6           60
1      b    K6           65
5      b    K6           65 
 <class 'pandas.core.frame.DataFrame'> 

g
  Gender Level  Performance
2      g    K6           70
3      g    K6           75
4      g    K6           70 
 <class 'pandas.core.frame.DataFrame'> 

Retrieving groups

  • You can retrieve any of the constructed groups with get_group() passing as argument the value of the grouping factor.
  • 'Retrieving' here means that get_group() will return a new DataFrame object. Therefore, you can apply on it any operations valid for DataFrames.
In [12]:
bdf = grby_obj.get_group('b')
print(bdf.head(),'\n')
print(bdf.Performance.head(),'\n')
print(bdf.Performance.mean())
  Gender Level  Performance
0      b    K6           60
1      b    K6           65
5      b    K6           65
6      b    K6           60
7      b    K6           65 

0    60
1    65
5    65
6    60
7    65
Name: Performance, dtype: int64 

59.76190476190476

Aggregate: applying functions to grouped data

  • Grouped objects (that is, 'DataFrameGroupBy' objects) offer the possibility for applying functions on data columns.
  • This can be done by calling the agg() function for the grouped object and passing as argument the function we would like to apply.
In [13]:
import numpy as np 
print(grby_obj.agg(np.mean))
        Performance
Gender             
b         59.761905
g         75.789474
  • More than one functions can be applied on a column, passing them as a list argument in agg()
In [14]:
print(grby_obj.agg([np.mean, np.std]))
       Performance          
              mean       std
Gender                      
b        59.761905  7.154752
g        75.789474  7.685332
  • We can pass as argument in agg() our own functions as the example below demonstrates.
  • If we need to apply the function on one specific column data then the argument should be a dictionary having as key the name of the column where the function is to be applied
In [15]:
import numpy as np 

# ssd() computes the sum of squared deviate from the group mean 
def ssd(x):
    mn = x.mean()
    sm = pow(x-mn,2).sum()
    return sm

# We apply ssd() only on nnumerical data of the 'Performance' column
print(grby_obj.agg({'Performance':ssd}))
        Performance
Gender             
b       1023.809524
g       1063.157895

GroupBy with 2 factors

  • You can easily groupby 2 factors by passing the factor names (column names) as a list argument in 'groupby'
  • See below the form that groups take now.
In [16]:
grby_obj2 = data.groupby(['Gender','Level'])
grby_obj2.describe()
Out[16]:
Performance
Gender Level
b K6 count 11.000000
mean 60.000000
std 7.416198
min 45.000000
25% 57.500000
50% 60.000000
75% 65.000000
max 70.000000
K9 count 10.000000
mean 59.500000
std 7.245688
min 45.000000
25% 56.250000
50% 62.500000
75% 65.000000
max 65.000000
g K6 count 7.000000
mean 77.857143
std 7.559289
min 70.000000
25% 72.500000
50% 75.000000
75% 82.500000
max 90.000000
K9 count 12.000000
mean 74.583333
std 7.821396
min 60.000000
25% 70.000000
50% 75.000000
75% 76.250000
max 90.000000
  • To get any specific group you must supply now a tuple with the values of the two grouping factors
In [17]:
grby_obj2.get_group(('b','K6'))
Out[17]:
Performance
0 60
1 65
5 65
6 60
7 65
8 70
13 45
14 50
15 55
16 60
17 65

. Free learning material
. See full copyright and disclaimer notice