Prev: Plotting a Series/DataFrame object | Next: -

- Suppose you have collected your data from your educational technology research project. Your subjects are school students who have been working following some innovative technology-supported didactic model. You have entered your data in a spreadsheet and now you are ready to proceed with processing. Naturally, in your student samples there are both boys and girls and you think that it would be nice if you could group your students based on gender and do some boys-girls group comparisons. Furthermore, you have data coming from classes of different age/level (say, K6 and K9). Wouldn't it be nice to further group your data based also on students' level? Here is where '
**groupby**' comes in. - '
**groupby**' is a**pandas powerful method for grouping and dividing your original data into subgroups, based on one or more grouping factor(s) that you consider important**(like gender and age in the above scenario). After grouping you can proceed and implement all available numpy/scipy/pandas-based statistics in the subgroups thus increasing the power of your analysis. Of course, you need to additionally enter in your data spreadsheet the information necessary for grouping (the grouping factor(s)).

- We are going now to input some sample data to explore the potential offered by 'groupby'
- Data in the spreadsheet are organized as you see in the figure below. Note the column titles:
**'Gender', 'Level' and "Performance'**.

- As a first step read the sheet data from the file into a pandas DataFrame

In [1]:

```
import numpy as np
import pandas as pd
data = pd.read_excel('../../data/researchdata.xlsx', sheetname="groupby")
data.head(5)
```

Out[1]:

- Now group data with 'Gender' as grouping factor. In this first example we apply a
**one-factor grouping**(group by 'Gender'). We will later explore grouping with two factors.

In [2]:

```
grby_obj = data.groupby('Gender')
grby_obj
```

Out[2]:

- You see that
**grby_obj**has been constructed as a '**DataFrameGroupBy**' object and printing this type of object does not displsy anything unless some information about the constructed object. - We can, however, taste a bit of the
**grby_obj**object by exploring available methods and attribute through dir()

In [3]:

```
# Change the slicing below to see more of the object attributes
print(dir(grby_obj)[:20])
```

**len()**returns the number of groups which are formed

In [4]:

```
len(grby_obj)
```

Out[4]:

- Same with
**.ngroups**

In [5]:

```
grby_obj.ngroups
```

Out[5]:

**.groups**will return a**dictionary**with name of groups as keys and an array of indexes identifying the data items included in the group

In [6]:

```
print(grby_obj.groups.keys(),'\n')
print(grby_obj.groups.values(),'\n')
print(grby_obj.groups['b'])
```

- See group statistics by calling
**describe()**

In [7]:

```
grby_obj.describe()
```

Out[7]:

**size()**returns the size of the groups

In [8]:

```
grby_obj.size()
```

Out[8]:

**.count()**provides similar info as size() displaying also all columns

In [9]:

```
grby_obj.count()
```

Out[9]:

- Using
**nth()**you can retrieve any specific item in the group by specifying its order within the group

In [10]:

```
grby_obj.nth(2)
```

Out[10]:

**Iterating**over the groupby object returns a**tuple**containing**the name and the respective group**(a DataFrame in itself)

In [11]:

```
for gp_name, gp in grby_obj:
print(gp_name)
print(gp.head(3),'\n',type(gp),'\n')
```

- You can retrieve any of the constructed groups with
**get_group()**passing as argument the value of the grouping factor. - 'Retrieving' here means that
**get_group()**will return a**new DataFrame object**. Therefore, you can apply on it any operations valid for DataFrames.

In [12]:

```
bdf = grby_obj.get_group('b')
print(bdf.head(),'\n')
print(bdf.Performance.head(),'\n')
print(bdf.Performance.mean())
```

- Grouped objects (that is, 'DataFrameGroupBy' objects) offer the possibility for applying functions on data columns.
- This can be done by calling the agg() function for the grouped object and passing as argument the function we would like to apply.

In [13]:

```
import numpy as np
print(grby_obj.agg(np.mean))
```

- More than one functions can be applied on a column, passing them as a list argument in agg()

In [14]:

```
print(grby_obj.agg([np.mean, np.std]))
```

- We can pass as argument in agg() our own functions as the example below demonstrates.
- If we need to apply the function on one specific column data then the argument should be a dictionary having as key the name of the column where the function is to be applied

In [15]:

```
import numpy as np
# ssd() computes the sum of squared deviate from the group mean
def ssd(x):
mn = x.mean()
sm = pow(x-mn,2).sum()
return sm
# We apply ssd() only on nnumerical data of the 'Performance' column
print(grby_obj.agg({'Performance':ssd}))
```

- You can easily groupby 2 factors by passing the factor names (column names) as a list argument in 'groupby'
- See below the form that groups take now.

In [16]:

```
grby_obj2 = data.groupby(['Gender','Level'])
grby_obj2.describe()
```

Out[16]:

- To get any specific group you must supply now a tuple with the values of the two grouping factors

In [17]:

```
grby_obj2.get_group(('b','K6'))
```

Out[17]:

. Free learning material

. See full copyright and disclaimer notice