Quick Contact


    Hierarchical Indexing

    Hierarchical indexing is an essential feature of pandas that enables you to have mul‐ tiple (two or more) index levels on an axis. Somewhat abstractly, it supports a way for you to work with larger dimensional record in a lower dimensional form. Let’s begin with a simple example; generate a Series with a list of lists (or arrays) as the index:

    In [9]: data = pd.Series(np.random.randn(9),

    ...: index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],

    ...: [1, 2, 3, 1, 3, 1, 2, 2, 3]])

    In [10]: data

    Out[10]:

    a 1 -0.204708

    2 0.478943

    3 -0.519439

    b 1 -0.555730

    3 1.965781

    c 1 1.393406

    2 0.092908

    d 2 0.281746

    3 0.769023

    dtype: float64

    Hierarchical indexing plays an essential role in reshaping information and group-based operations like structuring a pivot table. For example, we can rearrange the data into a DataFrame using its unstack method:

    In [16]: data.unstack()

    Out[16]:

    1 2 3

    a -0.204708 0.478943 -0.519439

    b -0.555730 NaN 1.965781

    c 1.393406 0.092908 NaN

    d NaN 0.281746 0.769023

    The inverse operation of unstack is stack:

    In [17]: data.unstack().stack()

    Out[17]:

    a 1 -0.204708

    2 0.478943

    3 -0.519439

    b 1 -0.555730

    3 1.965781

    c 1 1.393406

    2 0.092908

    d 2 0.281746

    3 0.769023

    dtype: float64

    stack and unstack will be explored in more detail later in this chapter.

    With a DataFrame, either axis can have a hierarchical index:

    In [18]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),

    ....: index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],

    ....: columns=[['Ohio', 'Ohio', 'Colorado'],

    ....: ['Green', 'Red', 'Green']])

    In [19]: frame

    Out[19]:

    Ohio Colorado

    Green Red Green

    a 1 0 1 2

    2 3 4 5

    b 1 6 7 8

    2 9 10 11

    The hierarchical levels can have names (as strings or any Python objects). If so, these

    will show up in the console output:

    
    

    With partial column indexing you can similarly select groups of columns:

    In [23]: frame['Ohio']

    Out[23]:

    color Green Red

    key1 key2

    a 1 0 1

    2 3 4

    b 1 6 7

    2 9 10

    A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this: MultiIndex.from_arrays([[‘Ohio’, ‘Ohio’, ‘Colorado’],

    ['Green', 'Red', 'Green']],

    names=['state', 'color'])

    Combining and Merging Datasets

    Data contained in pandas objects can be combined together in a number of ways:

    pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.

    pandas.concat concatenates or “stacks” together objects along an axis.

    The combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

    Database-Style DataFrame Joins

    Merge or join operations join datasets by linking rows using one or more keys. These operations are central to relational databases (e.g., SQL-based). The merge function in pandas is the main entry point for utilizing these algorithms on our data.

    Example

    In [35]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],

    ....: 'data1': range(7)})

    In [36]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'],

    ....: 'data2': range(3)})

    In [37]: df1

    Out[37]:

    data1 key

    0 0 b

    1 1 b

    2 2 a

    3 3 c

    4 4 a

    5 5 a

    6 6 b

    In [38]: df2

    Out[38]:

    data2 key

    0 0 a

    1 1 b

    2 2 d

    This is an example of a many-to-one join; the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling merge with these objects we obtain:

    In [39]: pd.merge(df1, df2)

    Out[39]:

    data1 key data2

    0 0 b 1

    1 1 b 1

    2 6 b 1

    3 2 a 0

    4 4 a 0

    5 5 a 0

    Suppose the column names are different in each object, we can determine them separately:

    In [41]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],

    ....: 'data1': range(7)})

    In [42]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],

    ....: 'data2': range(3)})

    In [43]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')

    Out[43]:

    data1 lkey data2 rkey

    0 0 b 1 b

    1 1 b 1 b

    2 6 b 1 b

    3 2 a 0 a

    4 4 a 0 a

    5 5 a 0 a

    Copyright 1999- Ducat Creative, All rights reserved.