Skip to main content

Pandas

· 23 min read
Shaurya Singhal

Source: View original notebook on GitHub

Category: Machine Learning / Learn ML libraries


Pandas

  • Basically used to do work like in excel.(You can think of pandas as an extremely powerful version of Excel, with a lot more features.)
  • Pandas(Panel data)

Topics discussed:-

  • Introduction to Pandas and its Features
  • Series
  • DataFrames
    • selecting rows and columns (using [] or loc/iloc attribute)
    • Adding and removing Rows (using [] to add and drop() to remove)
    • Conditional statements on DataFrames
    • Setting and Resetting Indexes (use set_index() and reset_index())
    • Missing Data (use dropna() to clean data and fillna() to fill missing data)
    • GroupBy (groupby())
    • convert data to numpy(using df.values)
    • Merging,Joining,and Concatenating(merge(), join(), concat())
    • Operations (unique(), nunique(), value_counts(), isnull(), sort_values(),apply() column, index etc)
  • Data Input and Output
  • Pandas Plotting

Features

  • pandas - a powerful data analysis and manipulation library for Python
  • pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.
  • It aims to be the fundamental high-level building block for doing practical,real worlddata analysis in Python.

Main Features

Here are just a few of the things that pandas does well:

  • Easy handling of missing data in floating point as well as non-floating point data.
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects.
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations.
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data.
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
  • Intuitive merging and joining data sets.
  • Flexible reshaping and pivoting of data sets.
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format.
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
import pandas as pd
import numpy as np

Series and DataFrames are the main modules of Pandas Module we will use

1. pd.Series

  • basically used to create a hashtable with key,value pair namely (index,data).
  • A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
  • it can hold any type of data even built_in_functions as well.
import numpy as np
import pandas as pd
data = np.array([1,2,3,4])
label = np.array([0,1,2,3])
label_char = np.array(['shaurya', 'priyansh', 'mamta', 'mukesh'])
hash_table = pd.Series(data) # By default indices will be 0,1,2,3 and so on...

# providing numbered-indexing as well
hash_table2 = pd.Series(data = data, index = label)


# providing custom-indexing as well
hash_table3 = pd.Series(data = data, index = label_char)


# providing dictionary d
d= {'shaurya':1, 'singhal':2, 'mukesh':3, 'mamta':4}
hash_table4 = pd.Series(d)

daata = list(zip(label_char, data))
hash_table5 = pd.Series(daata)
print(hash_table)
print('----------------')
print(hash_table2)
print('----------------')
print(hash_table3)
print('----------------')
print(hash_table4)
print('----------------')
print(hash_table5)
print('----------------')

Output:

0    1
1 2
2 3
3 4
dtype: int32
----------------
0 1
1 2
2 3
3 4
dtype: int32
----------------
shaurya 1
priyansh 2
mamta 3
mukesh 4
dtype: int32
----------------
shaurya 1
singhal 2
mukesh 3
mamta 4
dtype: int64
----------------
0 (shaurya, 1)
1 (priyansh, 2)
2 (mamta, 3)
3 (mukesh, 4)
dtype: object
----------------
# saving functions
hash_fun = pd.Series(data = [sum,min,max,print,len])
hash_fun

Output:

0      <built-in function sum>
1 <built-in function min>
2 <built-in function max>
3 <built-in function print>
4 <built-in function len>
dtype: object

Operations are then also done based off of index:

ser1 = pd.Series([1,2,3,4], index = ['USA', 'Germany','USSR', 'Japan'])
ser2 = pd.Series([1,2,5,4], index = ['USA', 'Germany','Italy', 'Japan'])
ser1 + ser2

Output:

Germany    4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
ser1

Output:

USA        1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2

Output:

USA        1
Germany 2
Italy 5
Japan 4
dtype: int64
pd.Series?

2. DataFrames

  • DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index(combine various hashtables with same keys or indices). Let's use pandas to explore this topic!

  • we will show DataFrames being Series of Series and infact built on numpy ARRAYS

  • DataFrames have (index,data,columns) in the order data,index,columns
import pandas as pd
from numpy.random import randn
df = pd.DataFrame(data = randn(4,4), index=['A','B','C','D'], columns=['E','F','G','H'])
df

Output:

E         F         G         H
A 0.211382 0.277118 -0.069357 1.751781
B -1.107231 0.458712 0.533907 1.065361
C 0.230110 2.127683 0.612683 -0.938353
D -1.991244 0.719935 1.462831 -1.112004
data  = np.arange(5)
daata = list(zip(['s','p','ma','mu'], data))
print(daata)
df2 = pd.DataFrame(daata,index = ['A','B','C','D'])

Output:

[('s', 0), ('p', 1), ('ma', 2), ('mu', 3)]
df2

Output:

0  1
A s 0
B p 1
C ma 2
D mu 3
df

Output:

E         F         G         H
A 0.072135 1.072180 -0.498480 1.228543
B -1.726358 -0.774394 -1.335940 0.078463
C 1.068696 0.198596 -0.986149 -0.622723
D -0.007756 -0.481936 -1.610555 0.667122

(2a)Selecting/Indexing rows(use loc/iloc attribute) and colums(use []) of DataFrame

# if we look and select E-th column use:
print(df['E']) # selecting column of DataFrame
print('------')
print(type(df['E']))
# it is a pandas - Series

Output:

A    0.072135
B -1.726358
C 1.068696
D -0.007756
Name: E, dtype: float64
------
<class 'pandas.core.series.Series'>
# selecting row
# df['A'] # this is an error

ref = df.loc
print(ref['A'])
print('---------')
type(ref['A'])
# this is also a pandas- Series

Output:

E    0.072135
F 1.072180
G -0.498480
H 1.228543
Name: A, dtype: float64
---------
pandas.core.series.Series
# above thing can be combined
print(df.loc['A'])

Output:

E    0.072135
F 1.072180
G -0.498480
H 1.228543
Name: A, dtype: float64
# can use iloc to select based on numerical index starting with 0 (no Surprise here :| )
print(df.iloc[0]) # same as df.loc['A']

Output:

E    0.072135
F 1.072180
G -0.498480
H 1.228543
Name: A, dtype: float64

Conclusion

  • DataFrames are basically Series of Series
  • since Series are based on numpy ,DatatFrames are as well. lets call shape attribute to it
df.shape

Output:

(4, 4)
# we can select multiple columns/rows (pass list of columns/rows)as well ex-
df[['E','F']]

Output:

E         F
A 0.072135 1.072180
B -1.726358 -0.774394
C 1.068696 0.198596
D -0.007756 -0.481936
df[['E','F']].loc[['C','D']]

Output:

E         F
C 1.068696 0.198596
D -0.007756 -0.481936
# or use iloc
df[['E','F']].iloc[[2,3]]

Output:

E         F
C 1.068696 0.198596
D -0.007756 -0.481936
# grabbing individual element
df.loc['A','E']

Output:

0.07213467134028595

Converting DataFrame into numpy arrays (use values attribute on DataFrame object)

# converting DataFrame into numpy arrays
df = pd.DataFrame(data = randn(4,4))
print(type(df))
print()
print(df)
print('======================')
print()
df = df.values
print(type(df))
print(df)
print('======================')

Output:

<class 'pandas.core.frame.DataFrame'>

0 1 2 3
0 0.208281 -0.738167 0.269264 0.028646
1 -1.445427 -0.562245 -2.903674 -0.087284
2 -1.328404 -0.255863 -0.023696 -0.346428
3 -0.737738 -0.947259 -1.132884 -0.592254
======================

<class 'numpy.ndarray'>
[[ 0.20828142 -0.73816711 0.26926379 0.0286463 ]
[-1.44542694 -0.56224521 -2.90367424 -0.08728398]
[-1.32840381 -0.25586251 -0.02369633 -0.34642762]
[-0.73773775 -0.94725893 -1.13288378 -0.59225421]]
======================

(2b)Adding and Removing colums/rows in DataFrames

# Adding
df

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
df['new'] = df['E'] + df['F'] # remember hashtable(map in c++) ,this add new column to table
df

Output:

E         F         G         H       new
A 0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
df.loc['new_row'] = df.loc['A'] + df.loc['B'] # adding row using loc/iloc attribute
df

Output:

E         F         G         H       new
A 0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
new_row -1.634011 -1.508054 -0.995743 0.375528 -3.142065
# removing -> not inplace or do it by setting that inplace attribute True
df.drop('A', axis=0) # set axis to 0 for rows
# df.drop('E',axis=1) # set axis to 1 for columms

Output:

E         F         G         H       new
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
new_row -1.634011 -1.508054 -0.995743 0.375528 -3.142065
df # see nothing is dropped ->??

Output:

E         F         G         H       new
A 0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
new_row -1.634011 -1.508054 -0.995743 0.375528 -3.142065
# to Drop set inplace =True
df.drop('new',axis=1, inplace=True)
df # now new is removed

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
new_row -1.634011 -1.508054 -0.995743 0.375528
# lets remove new_row as well
df.drop('new_row', inplace=True, axis=0)
df # back to Square One!!

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615

(2c) Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
df>0

Output:

E      F      G      H
A True False False False
B False False True True
C False False True False
D True True False False
df[df>0]

Output:

E         F         G         H
A 0.107055 NaN NaN NaN
B NaN NaN 0.111722 0.444997
C NaN NaN 0.882894 NaN
D 0.991462 1.046418 NaN NaN
df[df<1]

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 NaN -1.461521 -0.365615
df

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
df[df['E']>0] # will select all columns where in E-th columns values are >0

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
D 0.991462 1.046418 -1.461521 -0.365615
# to select only F and G th column where Eth values are >0
df[df['E']>0][['F','G']] # why? --->>> since df[df['E']>0] this is also a Dataframes we can chain concept learn till now.

Output:

F         G
A -0.856872 -1.107465
D 1.046418 -1.461521

For two conditions you can use | and & with parenthesis :(dont use and or as they are use for single elements

not on a series/array of elements)

# to show my point
df['E']>0

Output:

A     True
B False
C False
D True
Name: E, dtype: bool
# to show my point
df['F']>1

Output:

A    False
B False
C False
D True
Name: F, dtype: bool
# df[(df['E']>0) and (df['F'] > 1)] # is an error -> The truth value of a Series is ambiguous.  
df[(df['E']>0) & (df['F'] > 1)] # only Dth column is returned

Output:

E         F         G         H
D 0.991462 1.046418 -1.461521 -0.365615
# similarly
# df[(df['E']>0) or (df['F'] > 1)] # error
df[(df['E']>0) | (df['F'] > 1)]

Output:

E         F         G         H
A 0.107055 -0.856872 -1.107465 -0.069469
D 0.991462 1.046418 -1.461521 -0.365615

(2d) Index setting(using set_index(value)) and Resetting(using reset_index())

df

Output:

E         F         G         H
A 0.211382 0.277118 -0.069357 1.751781
B -1.107231 0.458712 0.533907 1.065361
C 0.230110 2.127683 0.612683 -0.938353
D -1.991244 0.719935 1.462831 -1.112004
# Reset to default 0,1...n index
df.reset_index() # also use inplace=True to make effect permanent

Output:

index         E         F         G         H
0 A 0.211382 0.277118 -0.069357 1.751781
1 B -1.107231 0.458712 0.533907 1.065361
2 C 0.230110 2.127683 0.612683 -0.938353
3 D -1.991244 0.719935 1.462831 -1.112004
df # showing resetting was not inplace

Output:

E         F         G         H
A 0.211382 0.277118 -0.069357 1.751781
B -1.107231 0.458712 0.533907 1.065361
C 0.230110 2.127683 0.612683 -0.938353
D -1.991244 0.719935 1.462831 -1.112004
newind = 'CA NY WY OR'.split()
df['States'] = newind # adding new col
df

Output:

E         F         G         H States
A 0.211382 0.277118 -0.069357 1.751781 CA
B -1.107231 0.458712 0.533907 1.065361 NY
C 0.230110 2.127683 0.612683 -0.938353 WY
D -1.991244 0.719935 1.462831 -1.112004 OR
df.set_index('States') # setting different index !!! remenber it overwrittes original indexes!!! also by default not inplace

Output:

E         F         G         H
States
CA 0.211382 0.277118 -0.069357 1.751781
NY -1.107231 0.458712 0.533907 1.065361
WY 0.230110 2.127683 0.612683 -0.938353
OR -1.991244 0.719935 1.462831 -1.112004
df

Output:

E         F         G         H States
A 0.107055 -0.856872 -1.107465 -0.069469 CA
B -1.741066 -0.651183 0.111722 0.444997 NY
C -1.150753 -0.368783 0.882894 -1.017265 WY
D 0.991462 1.046418 -1.461521 -0.365615 OR
df.set_index('States',inplace=True)
df

Output:

E         F         G         H
States
CA 0.211382 0.277118 -0.069357 1.751781
NY -1.107231 0.458712 0.533907 1.065361
WY 0.230110 2.127683 0.612683 -0.938353
OR -1.991244 0.719935 1.462831 -1.112004

(2e) Missing Data( use dropna() and fillna()) use inplace=True to make changes Permanent

import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan], 'B':[5,np.nan,np.nan],'C':[1,2,3]}) # passing dict to Dataframe make keys its column
# indexes and data being each cell data with rows indexed using 0,1,2,3,4
df

Output:

A    B  C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.dropna() # drops row(bydefault axis = 0) with even a single NaN

Output:

A    B  C
0 1.0 5.0 1
df.dropna(axis=1)

Output:

C
0 1
1 2
2 3
df.dropna(thresh=2) # to set minimum no of NaN's to be found to drop that row(axis=0) or col(axis=1)

Output:

A    B  C
0 1.0 5.0 1
1 2.0 NaN 2
df.fillna(value='FILL VALUE')

Output:

A           B  C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3
df # not doing it inplace till now use inplace=True to do that

Output:

A    B  C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df['A']

Output:

0    1.0
1 2.0
2 NaN
Name: A, dtype: float64
df['A'].mean()

Output:

1.5
df['A'].fillna(value=df['A'].mean())

Output:

0    1.0
1 2.0
2 1.5
Name: A, dtype: float64

(2f) Groupby

The groupby() method allows you to group rows of data together and call aggregate functions

<img src = 'images/groupby_func.jpg' width=200>

  • use aggregate() function to apply function over group by object to collect groups
  • use transform(funct)
  • use filter(funct)
import pandas as pd
# Create dataframe
data = {'Company':['GOOGLE','GOOGLE','MICROSOFT','MICROSOFT','FACEBOOK','FACEBOOK'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df

Output:

Company   Person  Sales
0 GOOGLE Sam 200
1 GOOGLE Charlie 120
2 MICROSOFT Amy 340
3 MICROSOFT Vanessa 124
4 FACEBOOK Carl 243
5 FACEBOOK Sarah 350

Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:

df.groupby('Company')

Output:

&lt;pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10283790&gt;

You can save this object as a new variable:

by_comp = df.groupby("Company")

And then call aggregate methods off the object :

by_comp.mean()

Output:

Sales
Company
FACEBOOK 296.5
GOOGLE 160.0
MICROSOFT 232.0
by_comp.aggregate(np.mean)

Output:

Sales
Company
FACEBOOK 296.5
GOOGLE 160.0
MICROSOFT 232.0
by_comp.aggregate([np.max,np.min,np.mean, np.std])

Output:

Sales                        
amax amin mean std
Company
FACEBOOK 350 243 296.5 75.660426
GOOGLE 200 120 160.0 56.568542
MICROSOFT 340 124 232.0 152.735065
df.groupby('Company').mean()

Output:

Sales
Company
FACEBOOK 296.5
GOOGLE 160.0
MICROSOFT 232.0

More examples of aggregate methods:

by_comp.std()

Output:

Sales
Company
FACEBOOK 75.660426
GOOGLE 56.568542
MICROSOFT 152.735065
by_comp.min()

Output:

Person  Sales
Company
FACEBOOK Carl 243
GOOGLE Charlie 120
MICROSOFT Amy 124
by_comp.max()

Output:

Person  Sales
Company
FACEBOOK Sarah 350
GOOGLE Sam 200
MICROSOFT Vanessa 340
by_comp.count()

Output:

Person  Sales
Company
FACEBOOK 2 2
GOOGLE 2 2
MICROSOFT 2 2
# to get all above

by_comp.describe()

Output:

Sales                                                        
count mean std min 25% 50% 75% max
Company
FACEBOOK 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
GOOGLE 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
MICROSOFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
by_comp.describe().transpose()

Output:

Company        FACEBOOK      GOOGLE   MICROSOFT
Sales count 2.000000 2.000000 2.000000
mean 296.500000 160.000000 232.000000
std 75.660426 56.568542 152.735065
min 243.000000 120.000000 124.000000
25% 269.750000 140.000000 178.000000
50% 296.500000 160.000000 232.000000
75% 323.250000 180.000000 286.000000
max 350.000000 200.000000 340.000000
by_comp.describe().transpose()['GOOGLE']

Output:

Sales  count      2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
Name: GOOGLE, dtype: float64
# transpose can br done using just T

by_comp.describe().T['GOOGLE']

Output:

Sales  count      2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
Name: GOOGLE, dtype: float64

(2g) Mergeing ,Joining and Concatenation

  • There are three ways of combining two DataFrames namely merging ,joining ,concatenating.

Lets crete dataFrames object first to illustrate it:

import pandas as pd

df1 = pd.DataFrame( {'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3'], 'C':['C0','C1','C2','C3'],
'D' : ['D0','D1','D2','D3']})

df2 = pd.DataFrame( {'A':['A4','A5','A6','A7'], 'B':['B4','B5','B6','B7'], 'C':['C4','C5','C6','C7'],
'D' : ['D4','D5','D6','D7']})


df3 = pd.DataFrame( {'A':['A8','A9','A10','A11'], 'B':['B8','B9','B10','B11'], 'C':['C8','C9','C10','C11'],
'D' : ['D8','D9','D10','D11']})
print(df1)
print('----------')

print(df2)
print('----------')

print(df3)
print('----------')

Output:

A   B   C   D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
----------
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
----------
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
----------

Concatenation using pd.concat()

  • Concatination basically glues together DataFrames.Keep in mind that dimensions should match along the axis u are performing concatenation
df_concat_rows = pd.concat([df1,df2,df3]) # pass any sequence either list,tuple or dict
df_concat_cols = pd.concat([df1,df2,df3],axis=1)
df_concat_rows

Output:

A    B    C    D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
df_concat_cols

Output:

A   B   C   D   A   B   C   D    A    B    C    D
0 A0 B0 C0 D0 A4 B4 C4 D4 A8 B8 C8 D8
1 A1 B1 C1 D1 A5 B5 C5 D5 A9 B9 C9 D9
2 A2 B2 C2 D2 A6 B6 C6 D6 A10 B10 C10 D10
3 A3 B3 C3 D3 A7 B7 C7 D7 A11 B11 C11 D11

Merging use pd.merge()

  • merging is done based on some key(column) available in both mergeable DataFrames
  • can be of four type ,default is inner ; others are left ,right,outer.
import pandas as pd

df1 = pd.DataFrame({'KEY':['K0','K1','K2','K3'], 'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3']})

df2 = pd.DataFrame( {'KEY':['K0','K1','K2','K3'], 'A':['A4','A5','A6','A7'], 'B':['B4','B5','B6','B7']})
df1

Output:

KEY   A   B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
df2

Output:

KEY   A   B
0 K0 A4 B4
1 K1 A5 B5
2 K2 A6 B6
3 K3 A7 B7
pd.merge(df1,df2, on='KEY')

Output:

KEY A_x B_x A_y B_y
0 K0 A0 B0 A4 B4
1 K1 A1 B1 A5 B5
2 K2 A2 B2 A6 B6
3 K3 A3 B3 A7 B7
#complicated Examples - lets see all four type of merge

df1 = pd.DataFrame({'KEY1':['K0','K0','K1','K2'],'KEY2':['K0','K1','K0','K1'], 'A': ['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
df2 = pd.DataFrame({'KEY1':['K0','K1','K1','K2'],'KEY2':['K0','K0','K0','K0'], 'A':['A4','A5','A6','A7'],
'B':['B4','B5','B6','B7']})
df1

Output:

KEY1 KEY2   A   B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
df2

Output:

KEY1 KEY2   A   B
0 K0 K0 A4 B4
1 K1 K0 A5 B5
2 K1 K0 A6 B6
3 K2 K0 A7 B7
pd.merge(df1,df2,on=['KEY1','KEY2']) # can u find out how it is done :/ search on google if not! 

# Hint
# inner is intersection of (key1,key2) of both DF's

Output:

KEY1 KEY2 A_x B_x A_y B_y
0 K0 K0 A0 B0 A4 B4
1 K1 K0 A2 B2 A5 B5
2 K1 K0 A2 B2 A6 B6
pd.merge(df1,df2,on=['KEY1','KEY2'], how='outer') # outer is union

Output:

KEY1 KEY2  A_x  B_x  A_y  B_y
0 K0 K0 A0 B0 A4 B4
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 A5 B5
3 K1 K0 A2 B2 A6 B6
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN A7 B7
pd.merge(df1,df2,on=['KEY1','KEY2'], how='left')

Output:

KEY1 KEY2 A_x B_x  A_y  B_y
0 K0 K0 A0 B0 A4 B4
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 A5 B5
3 K1 K0 A2 B2 A6 B6
4 K2 K1 A3 B3 NaN NaN
pd.merge(df1,df2,on=['KEY1','KEY2'], how='right')

Output:

KEY1 KEY2  A_x  B_x A_y B_y
0 K0 K0 A0 B0 A4 B4
1 K1 K0 A2 B2 A5 B5
2 K1 K0 A2 B2 A6 B6
3 K2 K0 NaN NaN A7 B7

Joining using join()

  • joining is basically combining columns over differently - index DataFrames(not based on columns as in merge) and is called upon DataFrame object.
  • also of four tyoe default being left this time
df1 = pd.DataFrame(data = {'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3']}, index=['K0','K1','K2','K3'] )
df2 = pd.DataFrame(data = {'C': ['A4','A5','A6','A7'], 'D':['B4','B5','B6','B7']}, index=['K0','K2','K2','K3'])
df1

Output:

A   B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
df2

Output:

C   D
K0 A4 B4
K2 A5 B5
K2 A6 B6
K3 A7 B7
df1.join(df2, how ='left')

Output:

A   B    C    D
K0 A0 B0 A4 B4
K1 A1 B1 NaN NaN
K2 A2 B2 A5 B5
K2 A2 B2 A6 B6
K3 A3 B3 A7 B7
df1.join(df2,how='right')

Output:

A   B   C   D
K0 A0 B0 A4 B4
K2 A2 B2 A5 B5
K2 A2 B2 A6 B6
K3 A3 B3 A7 B7

(2h) Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df

Output:

col1  col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz

Info about Unique Values

df['col2'].unique()

Output:

array([444, 555, 666], dtype=int64)
df['col2'].nunique()

Output:

3
df['col2'].value_counts()

Output:

444    2
555 1
666 1
Name: col2, dtype: int64

Selecting Data

#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf

Output:

col1  col2 col3
3 4 444 xyz

Applying Functions

def times2(x):
return x*2
df['col1'].apply(times2)

Output:

0    2
1 4
2 6
3 8
Name: col1, dtype: int64
df['col3'].apply(len)

Output:

0    3
1 3
2 3
3 3
Name: col3, dtype: int64
df['col1'].sum()

Output:

10

Permanently Removing a Column

del df['col1']
df

Output:

col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Get column and index names:

df.columns

Output:

Index(['col2', 'col3'], dtype='object')
df.index

Output:

RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame:

df

Output:

col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
df.sort_values(by='col2') #inplace=False by default

Output:

col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi

Find Null Values or Check for Null Values

df.isnull()

Output:

col2   col3
0 False False
1 False False
2 False False
3 False False

Data Input and Output

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:

import numpy as np
import pandas as pd
df

Output:

col1  col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz

CSV

CSV Output

df

Output:

col1  col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
df.to_csv('example',index =False)

CSV input

df = pd.read_csv('example')
df

Output:

col1  col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

DataFrame/Series object also have plot function to plot values with kind parameter to represent different types of graphs

%matplotlib inline
df

Output:

E         F         G         H
States
CA 0.211382 0.277118 -0.069357 1.751781
NY -1.107231 0.458712 0.533907 1.065361
WY 0.230110 2.127683 0.612683 -0.938353
OR -1.991244 0.719935 1.462831 -1.112004
df.plot(kind='bar')

Output

Output:

&lt;matplotlib.axes._subplots.AxesSubplot at 0x102abe10&gt;
df.plot(kind='hist')

Output

Output:

&lt;matplotlib.axes._subplots.AxesSubplot at 0x10313cb0&gt;