Home      |       Contents       |       About

Prev: The 'Series' object       |      Next: The 'DataFrame' object

Indexes and NaN values

(1) Handling indexes

Setting the index

  • Index labels need not be integers. The programmer can set the index labels by passing a list of indexes when constructing the Series object
In [1]:
import numpy as np
import pandas as pd

s = pd.Series([10,20,30,40,50], index = ['a','b','c','d','e'])
print(s)
print(s['c'])

x = s['c']+s['a']
print(x)
y = s[['b','e']]
print(y)
a    10
b    20
c    30
d    40
e    50
dtype: int64
30
40
b    20
e    50
dtype: int64

Changing index labels when using a dict

  • When constructing a Series from a dict you can reject the dictionary key labels and assign a new index list for the Series object.
In [2]:
import numpy as np
import pandas as pd

data = {'a':1,
        'b':2,
        'c':3,
        'd':4,
        'e':5}
new_ind = ['A','B','C','D','E']
s = pd.Series(sorted(data.values()), index=new_ind)
print(s)
A    1
B    2
C    3
D    4
E    5
dtype: int64

rename()

  • If, you need to change specific labels after the Series has been constructed you can use the rename() method, as in the examples below.
In [3]:
import numpy as np
import pandas as pd

data = {'a':1,
        'b':2,
        'c':3,
        'd':4,
        'e':5}

s = pd.Series(data)
print(s)

# Use rename() to change some of the current indexes
# Note that old indexes are mapped onto new indexes and a NEW Series is returned
s2 = s.rename({'a':'A', 'd':'D'})
print(s2)
a    1
b    2
c    3
d    4
e    5
dtype: int64
A    1
b    2
c    3
D    4
e    5
dtype: int64
  • If you need to change all indexes use rename() to set a mapping (dictionary) between lists of old and new keys
In [4]:
import numpy as np
import pandas as pd

data = {'a':1,
        'b':2,
        'c':3,
        'd':4,
        'e':5}

s = pd.Series(data)
print(s)

old_ind = s.index                             # returns the list of s indexes
new_ind = list((''.join(s.index)).upper())    # returns list of s indexes in upper case

# rename() accepts a dict with zipped lists
# inplace = True defines that no new Series object will be constructed

s.rename(dict(zip(old_ind,new_ind)), inplace=True)   
s
a    1
b    2
c    3
d    4
e    5
dtype: int64
Out[4]:
A    1
B    2
C    3
D    4
E    5
dtype: int64
  • Read more about the Series rename() method

reindex

  • reindex() works both for Series and DataFrame objects and resets the index labels in these objects. Additionally it offers the functionality of setting default values for new indexes other than NaN
In [5]:
import numpy as np
import pandas as pd

data = {'a':1,
        'b':2,
        'c':3,
        'd':4,
        'e':5}

s = pd.Series(data)
print(s)

old_ind = s.index 
added_ind = list((''.join(s.index)).upper())
new_ind = old_ind.union(added_ind)           # lists 'old_ind' and 'added_ind' are united

snew = s.reindex(new_ind, fill_value='Unknown')   # 'Unknown' is set for missing values
snew
a    1
b    2
c    3
d    4
e    5
dtype: int64
Out[5]:
A    Unknown
B    Unknown
C    Unknown
D    Unknown
E    Unknown
a          1
b          2
c          3
d          4
e          5
dtype: object

Automatic alignment

  • Index labels allow pandas to automatically align data when performing operations across many Series that share the same label indexes.
  • Automatic data alignment is an important additional feature of Series compared to numpy arrays.

In the following example, data alignment based on label indexes facilitates the addition of data between s1 and s2 series.

In [6]:
import numpy as np
import pandas as pd

s1 = pd.Series([10,20,30,40,50], index = ['a','b','c','d','e'], dtype='int')
s2 = pd.Series([5,10,15,20,25], index = ['e','d','c','b','a'], dtype='int')
print(s1)
print(s2)
s = s1+s2
s
a    10
b    20
c    30
d    40
e    50
dtype: int32
e     5
d    10
c    15
b    20
a    25
dtype: int32
Out[6]:
a    35
b    40
c    45
d    50
e    55
dtype: int32

(2) Handling the 'NaN' (missing) values

  • Many times data have missing values that are represented as NaN in a Series object
  • Identifying NaN values is easy with the isnull() and notnull() methods
In [7]:
import numpy as np
import pandas as pd

# Missing values are declared as 'None' in code
data = {'a':1,
        'b':None,
        'c':3,
        'd':None,
        'e':5}

s = pd.Series(data)
print(s,'\n')

print(s.isnull(),'\n')
print(s.notnull())

# Alternatively you may write: 
# pd.isnull(s)
# pd.notnull(s)
a     1
b   NaN
c     3
d   NaN
e     5
dtype: float64 

a    False
b     True
c    False
d     True
e    False
dtype: bool 

a     True
b    False
c     True
d    False
e     True
dtype: bool

Adding Series objects with data alignment

  • When an index label is missing in at least one of the many Series objects participating in some operation then it appears with NaN value in the outcome. This behavior, however, can be overriden as we shall see in the DataFrame object.
In [8]:
import numpy as np
import pandas as pd

# Data: Unemployment percentages in various countries
un_data = {'Greece':27,
           'Spain':21,
           'Italy':20}
uns = pd.Series(un_data)     # uns is a Series object based on un_data dictionary
print(uns, '\n')

uns.name='Unemployment'      # Setting the Series 'name' property
uns.index.name='Countries'   # Setting the Series 'index.name' property

inc_data = {'Spain':2.,     
            'Italy':-1.5,
            'Greece':1.,
            'Portugal':3.}   # The label 'Portugal' is not included in the uns Series 
incs = pd.Series(inc_data)   # A new Series object (incs) is constructed here

sums = uns + incs            # Adding the values of the two Series is no problem..  
                             #..as data are automatically aligned 
print(sums)
Greece    27
Italy     20
Spain     21
dtype: int64 

Greece      28.0
Italy       18.5
Portugal     NaN
Spain       23.0
dtype: float64

. Free learning material
. See full copyright and disclaimer notice