
Quick Contact
Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Files
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
Hierarchical Indexing
Hierarchical indexing is an essential feature of pandas that enables you to have mul- tiple (two or more) index levels on an axis. Somewhat abstractly, it supports a way for you to work with larger dimensional record in a lower dimensional form. Let’s begin with a simple example; generate a Series with a list of lists (or arrays) as the index:
In [9]: data = pd.Series(np.random.randn(9),
...: index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
...: [1, 2, 3, 1, 3, 1, 2, 2, 3]])
In [10]: data
Out[10]:
a 1 -0.204708
2 0.478943
3 -0.519439
b 1 -0.555730
3 1.965781
c 1 1.393406
2 0.092908
d 2 0.281746
3 0.769023
dtype: float64
Hierarchical indexing plays an essential role in reshaping information and group-based operations like structuring a pivot table. For example, we can rearrange the data into a DataFrame using its unstack method:
In [16]: data.unstack()
Out[16]:
1 2 3
a -0.204708 0.478943 -0.519439
b -0.555730 NaN 1.965781
c 1.393406 0.092908 NaN
d NaN 0.281746 0.769023
The inverse operation of unstack is stack:
In [17]: data.unstack().stack()
Out[17]:
a 1 -0.204708
2 0.478943
3 -0.519439
b 1 -0.555730
3 1.965781
c 1 1.393406
2 0.092908
d 2 0.281746
3 0.769023
dtype: float64
stack and unstack will be explored in more detail later in this chapter.
With a DataFrame, either axis can have a hierarchical index:
In [18]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
....: index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
....: columns=[['Ohio', 'Ohio', 'Colorado'],
....: ['Green', 'Red', 'Green']])
In [19]: frame
Out[19]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
The hierarchical levels can have names (as strings or any Python objects). If so, these
will show up in the console output:
With partial column indexing you can similarly select groups of columns:
In [23]: frame['Ohio']
Out[23]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this: MultiIndex.from_arrays([[‘Ohio’, ‘Ohio’, ‘Colorado’],
['Green', 'Red', 'Green']],
names=['state', 'color'])
Combining and Merging Datasets
Data contained in pandas objects can be combined together in a number of ways:
pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
pandas.concat concatenates or “stacks” together objects along an axis.
The combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.
Database-Style DataFrame Joins
Merge or join operations join datasets by linking rows using one or more keys. These operations are central to relational databases (e.g., SQL-based). The merge function in pandas is the main entry point for utilizing these algorithms on our data.
Example
In [35]: df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
....: 'data1': range(7)})
In [36]: df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
....: 'data2': range(3)})
In [37]: df1
Out[37]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
In [38]: df2
Out[38]:
data2 key
0 0 a
1 1 b
2 2 d
This is an example of a many-to-one join; the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling merge with these objects we obtain:
In [39]: pd.merge(df1, df2)
Out[39]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
Suppose the column names are different in each object, we can determine them separately:
In [41]: df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
....: 'data1': range(7)})
In [42]: df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
....: 'data2': range(3)})
In [43]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[43]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
Apply now for Advanced Data Science Course