Source: View original notebook on GitHub
Category: Machine Learning / Learn ML libraries

Pandas
- Basically used to do work like in excel.(You can think of pandas as an extremely powerful version of Excel, with a lot more features.)
- Pandas(Panel data)
Topics discussed:-
- Introduction to Pandas and its Features
- Series
- DataFrames
- selecting rows and columns (using
[]orloc/iloc attribute) - Adding and removing Rows (using
[]to add anddrop()to remove) - Conditional statements on DataFrames
- Setting and Resetting Indexes (use
set_index()andreset_index()) - Missing Data (use
dropna()to clean data andfillna()to fill missing data) - GroupBy (
groupby()) - convert data to numpy(using
df.values) - Merging,Joining,and Concatenating(
merge(),join(),concat()) - Operations (
unique(),nunique(),value_counts(),isnull(),sort_values(),apply()column,indexetc)
- selecting rows and columns (using
- Data Input and Output
- Pandas Plotting
Features
- pandas - a powerful data analysis and manipulation library for Python
- pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.
- It aims to be the fundamental high-level building block for doing practical,real worlddata analysis in Python.
Main Features
Here are just a few of the things that pandas does well:
Easy handling of missing datain floating point as well as non-floating point data.Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects.Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and letSeries,DataFrame, etc. automatically align the data for you in computations.Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data.Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.Intuitive merging and joining data sets.Flexible reshaping and pivoting of data sets.Robust IO tools for loading data from flat files(CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format.
Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
import pandas as pd
import numpy as np
Series and DataFrames are the main modules of Pandas Module we will use
1. pd.Series
- basically used to create a hashtable with key,value pair namely (index,data).
- A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
- it can hold any type of data even built_in_functions as well.
import numpy as np
import pandas as pd
data = np.array([1,2,3,4])
label = np.array([0,1,2,3])
label_char = np.array(['shaurya', 'priyansh', 'mamta', 'mukesh'])
hash_table = pd.Series(data) # By default indices will be 0,1,2,3 and so on...
# providing numbered-indexing as well
hash_table2 = pd.Series(data = data, index = label)
# providing custom-indexing as well
hash_table3 = pd.Series(data = data, index = label_char)
# providing dictionary d
d= {'shaurya':1, 'singhal':2, 'mukesh':3, 'mamta':4}
hash_table4 = pd.Series(d)
daata = list(zip(label_char, data))
hash_table5 = pd.Series(daata)
print(hash_table)
print('----------------')
print(hash_table2)
print('----------------')
print(hash_table3)
print('----------------')
print(hash_table4)
print('----------------')
print(hash_table5)
print('----------------')
Output:
0 1
1 2
2 3
3 4
dtype: int32
----------------
0 1
1 2
2 3
3 4
dtype: int32
----------------
shaurya 1
priyansh 2
mamta 3
mukesh 4
dtype: int32
----------------
shaurya 1
singhal 2
mukesh 3
mamta 4
dtype: int64
----------------
0 (shaurya, 1)
1 (priyansh, 2)
2 (mamta, 3)
3 (mukesh, 4)
dtype: object
----------------
# saving functions
hash_fun = pd.Series(data = [sum,min,max,print,len])
hash_fun
Output:
0 <built-in function sum>
1 <built-in function min>
2 <built-in function max>
3 <built-in function print>
4 <built-in function len>
dtype: object
Operations are then also done based off of index:
ser1 = pd.Series([1,2,3,4], index = ['USA', 'Germany','USSR', 'Japan'])
ser2 = pd.Series([1,2,5,4], index = ['USA', 'Germany','Italy', 'Japan'])
ser1 + ser2
Output:
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
ser1
Output:
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2
Output:
USA 1
Germany 2
Italy 5
Japan 4
dtype: int64
pd.Series?
2. DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language.
We can think of a DataFrame as a bunch of Series objects put together to share the same index(combine various hashtables with same keys or indices). Let's use pandas to explore this topic!we will show DataFrames being Series of Series and infact built on numpy ARRAYS
- DataFrames have (index,data,columns) in the order data,index,columns
import pandas as pd
from numpy.random import randn
df = pd.DataFrame(data = randn(4,4), index=['A','B','C','D'], columns=['E','F','G','H'])
df
Output:
E F G H
A 0.211382 0.277118 -0.069357 1.751781
B -1.107231 0.458712 0.533907 1.065361
C 0.230110 2.127683 0.612683 -0.938353
D -1.991244 0.719935 1.462831 -1.112004
data = np.arange(5)
daata = list(zip(['s','p','ma','mu'], data))
print(daata)
df2 = pd.DataFrame(daata,index = ['A','B','C','D'])
Output:
[('s', 0), ('p', 1), ('ma', 2), ('mu', 3)]
df2
Output:
0 1
A s 0
B p 1
C ma 2
D mu 3
df
Output:
E F G H
A 0.072135 1.072180 -0.498480 1.228543
B -1.726358 -0.774394 -1.335940 0.078463
C 1.068696 0.198596 -0.986149 -0.622723
D -0.007756 -0.481936 -1.610555 0.667122
(2a)Selecting/Indexing rows(use loc/iloc attribute) and colums(use []) of DataFrame
# if we look and select E-th column use:
print(df['E']) # selecting column of DataFrame
print('------')
print(type(df['E']))
# it is a pandas - Series
Output:
A 0.072135
B -1.726358
C 1.068696
D -0.007756
Name: E, dtype: float64
------
<class 'pandas.core.series.Series'>
# selecting row
# df['A'] # this is an error
ref = df.loc
print(ref['A'])
print('---------')
type(ref['A'])
# this is also a pandas- Series
Output:
E 0.072135
F 1.072180
G -0.498480
H 1.228543
Name: A, dtype: float64
---------
pandas.core.series.Series
# above thing can be combined
print(df.loc['A'])
Output:
E 0.072135
F 1.072180
G -0.498480
H 1.228543
Name: A, dtype: float64
# can use iloc to select based on numerical index starting with 0 (no Surprise here :| )
print(df.iloc[0]) # same as df.loc['A']
Output:
E 0.072135
F 1.072180
G -0.498480
H 1.228543
Name: A, dtype: float64
Conclusion
- DataFrames are basically Series of Series
- since Series are based on numpy ,DatatFrames are as well. lets call shape attribute to it
df.shape
Output:
(4, 4)
# we can select multiple columns/rows (pass list of columns/rows)as well ex-
df[['E','F']]
Output:
E F
A 0.072135 1.072180
B -1.726358 -0.774394
C 1.068696 0.198596
D -0.007756 -0.481936
df[['E','F']].loc[['C','D']]
Output:
E F
C 1.068696 0.198596
D -0.007756 -0.481936
# or use iloc
df[['E','F']].iloc[[2,3]]
Output:
E F
C 1.068696 0.198596
D -0.007756 -0.481936
# grabbing individual element
df.loc['A','E']
Output:
0.07213467134028595
Converting DataFrame into numpy arrays (use values attribute on DataFrame object)
# converting DataFrame into numpy arrays
df = pd.DataFrame(data = randn(4,4))
print(type(df))
print()
print(df)
print('======================')
print()
df = df.values
print(type(df))
print(df)
print('======================')
Output:
<class 'pandas.core.frame.DataFrame'>
0 1 2 3
0 0.208281 -0.738167 0.269264 0.028646
1 -1.445427 -0.562245 -2.903674 -0.087284
2 -1.328404 -0.255863 -0.023696 -0.346428
3 -0.737738 -0.947259 -1.132884 -0.592254
======================
<class 'numpy.ndarray'>
[[ 0.20828142 -0.73816711 0.26926379 0.0286463 ]
[-1.44542694 -0.56224521 -2.90367424 -0.08728398]
[-1.32840381 -0.25586251 -0.02369633 -0.34642762]
[-0.73773775 -0.94725893 -1.13288378 -0.59225421]]
======================
(2b)Adding and Removing colums/rows in DataFrames
# Adding
df
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
df['new'] = df['E'] + df['F'] # remember hashtable(map in c++) ,this add new column to table
df
Output:
E F G H new
A 0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
df.loc['new_row'] = df.loc['A'] + df.loc['B'] # adding row using loc/iloc attribute
df
Output:
E F G H new
A 0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
new_row -1.634011 -1.508054 -0.995743 0.375528 -3.142065
# removing -> not inplace or do it by setting that inplace attribute True
df.drop('A', axis=0) # set axis to 0 for rows
# df.drop('E',axis=1) # set axis to 1 for columms
Output:
E F G H new
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
new_row -1.634011 -1.508054 -0.995743 0.375528 -3.142065
df # see nothing is dropped ->??
Output:
E F G H new
A 0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B -1.741066 -0.651183 0.111722 0.444997 -2.392248
C -1.150753 -0.368783 0.882894 -1.017265 -1.519535
D 0.991462 1.046418 -1.461521 -0.365615 2.037880
new_row -1.634011 -1.508054 -0.995743 0.375528 -3.142065
# to Drop set inplace =True
df.drop('new',axis=1, inplace=True)
df # now new is removed
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
new_row -1.634011 -1.508054 -0.995743 0.375528
# lets remove new_row as well
df.drop('new_row', inplace=True, axis=0)
df # back to Square One!!
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
(2c) Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
df
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
df>0
Output:
E F G H
A True False False False
B False False True True
C False False True False
D True True False False
df[df>0]
Output:
E F G H
A 0.107055 NaN NaN NaN
B NaN NaN 0.111722 0.444997
C NaN NaN 0.882894 NaN
D 0.991462 1.046418 NaN NaN
df[df<1]
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 NaN -1.461521 -0.365615
df
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183 0.111722 0.444997
C -1.150753 -0.368783 0.882894 -1.017265
D 0.991462 1.046418 -1.461521 -0.365615
df[df['E']>0] # will select all columns where in E-th columns values are >0
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
D 0.991462 1.046418 -1.461521 -0.365615
# to select only F and G th column where Eth values are >0
df[df['E']>0][['F','G']] # why? --->>> since df[df['E']>0] this is also a Dataframes we can chain concept learn till now.
Output:
F G
A -0.856872 -1.107465
D 1.046418 -1.461521
For two conditions you can use | and & with parenthesis :(dont use and or as they are use for single elements
not on a series/array of elements)
# to show my point
df['E']>0
Output:
A True
B False
C False
D True
Name: E, dtype: bool
# to show my point
df['F']>1
Output:
A False
B False
C False
D True
Name: F, dtype: bool
# df[(df['E']>0) and (df['F'] > 1)] # is an error -> The truth value of a Series is ambiguous.
df[(df['E']>0) & (df['F'] > 1)] # only Dth column is returned
Output:
E F G H
D 0.991462 1.046418 -1.461521 -0.365615
# similarly
# df[(df['E']>0) or (df['F'] > 1)] # error
df[(df['E']>0) | (df['F'] > 1)]
Output:
E F G H
A 0.107055 -0.856872 -1.107465 -0.069469
D 0.991462 1.046418 -1.461521 -0.365615
(2d) Index setting(using set_index(value)) and Resetting(using reset_index())
df
Output:
E F G H
A 0.211382 0.277118 -0.069357 1.751781
B -1.107231 0.458712 0.533907 1.065361
C 0.230110 2.127683 0.612683 -0.938353
D -1.991244 0.719935 1.462831 -1.112004
# Reset to default 0,1...n index
df.reset_index() # also use inplace=True to make effect permanent
Output:
index E F G H
0 A 0.211382 0.277118 -0.069357 1.751781
1 B -1.107231 0.458712 0.533907 1.065361
2 C 0.230110 2.127683 0.612683 -0.938353
3 D -1.991244 0.719935 1.462831 -1.112004
df # showing resetting was not inplace
Output:
E F G H
A 0.211382 0.277118 -0.069357 1.751781
B -1.107231 0.458712 0.533907 1.065361
C 0.230110 2.127683 0.612683 -0.938353
D -1.991244 0.719935 1.462831 -1.112004
newind = 'CA NY WY OR'.split()
df['States'] = newind # adding new col
df
Output:
E F G H States
A 0.211382 0.277118 -0.069357 1.751781 CA
B -1.107231 0.458712 0.533907 1.065361 NY
C 0.230110 2.127683 0.612683 -0.938353 WY
D -1.991244 0.719935 1.462831 -1.112004 OR
df.set_index('States') # setting different index !!! remenber it overwrittes original indexes!!! also by default not inplace
Output:
E F G H
States
CA 0.211382 0.277118 -0.069357 1.751781
NY -1.107231 0.458712 0.533907 1.065361
WY 0.230110 2.127683 0.612683 -0.938353
OR -1.991244 0.719935 1.462831 -1.112004
df
Output:
E F G H States
A 0.107055 -0.856872 -1.107465 -0.069469 CA
B -1.741066 -0.651183 0.111722 0.444997 NY
C -1.150753 -0.368783 0.882894 -1.017265 WY
D 0.991462 1.046418 -1.461521 -0.365615 OR
df.set_index('States',inplace=True)
df
Output:
E F G H
States
CA 0.211382 0.277118 -0.069357 1.751781
NY -1.107231 0.458712 0.533907 1.065361
WY 0.230110 2.127683 0.612683 -0.938353
OR -1.991244 0.719935 1.462831 -1.112004
(2e) Missing Data( use dropna() and fillna()) use inplace=True to make changes Permanent
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan], 'B':[5,np.nan,np.nan],'C':[1,2,3]}) # passing dict to Dataframe make keys its column
# indexes and data being each cell data with rows indexed using 0,1,2,3,4
df
Output:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.dropna() # drops row(bydefault axis = 0) with even a single NaN
Output:
A B C
0 1.0 5.0 1
df.dropna(axis=1)
Output:
C
0 1
1 2
2 3
df.dropna(thresh=2) # to set minimum no of NaN's to be found to drop that row(axis=0) or col(axis=1)
Output:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
df.fillna(value='FILL VALUE')
Output:
A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3
df # not doing it inplace till now use inplace=True to do that
Output:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df['A']
Output:
0 1.0
1 2.0
2 NaN
Name: A, dtype: float64
df['A'].mean()
Output:
1.5
df['A'].fillna(value=df['A'].mean())
Output:
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
(2f) Groupby
The groupby() method allows you to group rows of data together and call aggregate functions
<img src = 'images/groupby_func.jpg' width=200>
- use
aggregate()function to apply function over group by object to collect groups - use
transform(funct) - use
filter(funct)
import pandas as pd
# Create dataframe
data = {'Company':['GOOGLE','GOOGLE','MICROSOFT','MICROSOFT','FACEBOOK','FACEBOOK'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
Output:
Company Person Sales
0 GOOGLE Sam 200
1 GOOGLE Charlie 120
2 MICROSOFT Amy 340
3 MICROSOFT Vanessa 124
4 FACEBOOK Carl 243
5 FACEBOOK Sarah 350
Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:
df.groupby('Company')
Output:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10283790>
You can save this object as a new variable:
by_comp = df.groupby("Company")
And then call aggregate methods off the object :
by_comp.mean()
Output:
Sales
Company
FACEBOOK 296.5
GOOGLE 160.0
MICROSOFT 232.0
by_comp.aggregate(np.mean)
Output:
Sales
Company
FACEBOOK 296.5
GOOGLE 160.0
MICROSOFT 232.0
by_comp.aggregate([np.max,np.min,np.mean, np.std])
Output:
Sales
amax amin mean std
Company
FACEBOOK 350 243 296.5 75.660426
GOOGLE 200 120 160.0 56.568542
MICROSOFT 340 124 232.0 152.735065
df.groupby('Company').mean()
Output:
Sales
Company
FACEBOOK 296.5
GOOGLE 160.0
MICROSOFT 232.0
More examples of aggregate methods:
by_comp.std()
Output:
Sales
Company
FACEBOOK 75.660426
GOOGLE 56.568542
MICROSOFT 152.735065
by_comp.min()
Output:
Person Sales
Company
FACEBOOK Carl 243
GOOGLE Charlie 120
MICROSOFT Amy 124
by_comp.max()
Output:
Person Sales
Company
FACEBOOK Sarah 350
GOOGLE Sam 200
MICROSOFT Vanessa 340
by_comp.count()
Output:
Person Sales
Company
FACEBOOK 2 2
GOOGLE 2 2
MICROSOFT 2 2
# to get all above
by_comp.describe()
Output:
Sales
count mean std min 25% 50% 75% max
Company
FACEBOOK 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
GOOGLE 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
MICROSOFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
by_comp.describe().transpose()
Output:
Company FACEBOOK GOOGLE MICROSOFT
Sales count 2.000000 2.000000 2.000000
mean 296.500000 160.000000 232.000000
std 75.660426 56.568542 152.735065
min 243.000000 120.000000 124.000000
25% 269.750000 140.000000 178.000000
50% 296.500000 160.000000 232.000000
75% 323.250000 180.000000 286.000000
max 350.000000 200.000000 340.000000
by_comp.describe().transpose()['GOOGLE']
Output:
Sales count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
Name: GOOGLE, dtype: float64
# transpose can br done using just T
by_comp.describe().T['GOOGLE']
Output:
Sales count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
Name: GOOGLE, dtype: float64
(2g) Mergeing ,Joining and Concatenation
- There are three ways of combining two DataFrames namely merging ,joining ,concatenating.
Lets crete dataFrames object first to illustrate it:
import pandas as pd
df1 = pd.DataFrame( {'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3'], 'C':['C0','C1','C2','C3'],
'D' : ['D0','D1','D2','D3']})
df2 = pd.DataFrame( {'A':['A4','A5','A6','A7'], 'B':['B4','B5','B6','B7'], 'C':['C4','C5','C6','C7'],
'D' : ['D4','D5','D6','D7']})
df3 = pd.DataFrame( {'A':['A8','A9','A10','A11'], 'B':['B8','B9','B10','B11'], 'C':['C8','C9','C10','C11'],
'D' : ['D8','D9','D10','D11']})
print(df1)
print('----------')
print(df2)
print('----------')
print(df3)
print('----------')
Output:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
----------
A B C D
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
----------
A B C D
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
----------
Concatenation using pd.concat()
- Concatination basically glues together DataFrames.Keep in mind that dimensions should match along the axis u are performing concatenation
df_concat_rows = pd.concat([df1,df2,df3]) # pass any sequence either list,tuple or dict
df_concat_cols = pd.concat([df1,df2,df3],axis=1)
df_concat_rows
Output:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
0 A8 B8 C8 D8
1 A9 B9 C9 D9
2 A10 B10 C10 D10
3 A11 B11 C11 D11
df_concat_cols
Output:
A B C D A B C D A B C D
0 A0 B0 C0 D0 A4 B4 C4 D4 A8 B8 C8 D8
1 A1 B1 C1 D1 A5 B5 C5 D5 A9 B9 C9 D9
2 A2 B2 C2 D2 A6 B6 C6 D6 A10 B10 C10 D10
3 A3 B3 C3 D3 A7 B7 C7 D7 A11 B11 C11 D11
Merging use pd.merge()
- merging is done based on some key(column) available in both mergeable DataFrames
- can be of four type ,default is inner ; others are left ,right,outer.
import pandas as pd
df1 = pd.DataFrame({'KEY':['K0','K1','K2','K3'], 'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3']})
df2 = pd.DataFrame( {'KEY':['K0','K1','K2','K3'], 'A':['A4','A5','A6','A7'], 'B':['B4','B5','B6','B7']})
df1
Output:
KEY A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
df2
Output:
KEY A B
0 K0 A4 B4
1 K1 A5 B5
2 K2 A6 B6
3 K3 A7 B7
pd.merge(df1,df2, on='KEY')
Output:
KEY A_x B_x A_y B_y
0 K0 A0 B0 A4 B4
1 K1 A1 B1 A5 B5
2 K2 A2 B2 A6 B6
3 K3 A3 B3 A7 B7
#complicated Examples - lets see all four type of merge
df1 = pd.DataFrame({'KEY1':['K0','K0','K1','K2'],'KEY2':['K0','K1','K0','K1'], 'A': ['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
df2 = pd.DataFrame({'KEY1':['K0','K1','K1','K2'],'KEY2':['K0','K0','K0','K0'], 'A':['A4','A5','A6','A7'],
'B':['B4','B5','B6','B7']})
df1
Output:
KEY1 KEY2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
df2
Output:
KEY1 KEY2 A B
0 K0 K0 A4 B4
1 K1 K0 A5 B5
2 K1 K0 A6 B6
3 K2 K0 A7 B7
pd.merge(df1,df2,on=['KEY1','KEY2']) # can u find out how it is done :/ search on google if not!
# Hint
# inner is intersection of (key1,key2) of both DF's
Output:
KEY1 KEY2 A_x B_x A_y B_y
0 K0 K0 A0 B0 A4 B4
1 K1 K0 A2 B2 A5 B5
2 K1 K0 A2 B2 A6 B6
pd.merge(df1,df2,on=['KEY1','KEY2'], how='outer') # outer is union
Output:
KEY1 KEY2 A_x B_x A_y B_y
0 K0 K0 A0 B0 A4 B4
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 A5 B5
3 K1 K0 A2 B2 A6 B6
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN A7 B7
pd.merge(df1,df2,on=['KEY1','KEY2'], how='left')
Output:
KEY1 KEY2 A_x B_x A_y B_y
0 K0 K0 A0 B0 A4 B4
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 A5 B5
3 K1 K0 A2 B2 A6 B6
4 K2 K1 A3 B3 NaN NaN
pd.merge(df1,df2,on=['KEY1','KEY2'], how='right')
Output:
KEY1 KEY2 A_x B_x A_y B_y
0 K0 K0 A0 B0 A4 B4
1 K1 K0 A2 B2 A5 B5
2 K1 K0 A2 B2 A6 B6
3 K2 K0 NaN NaN A7 B7
Joining using join()
- joining is basically combining columns over differently - index DataFrames(not based on columns as in merge) and is called upon DataFrame object.
- also of four tyoe default being left this time
df1 = pd.DataFrame(data = {'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3']}, index=['K0','K1','K2','K3'] )
df2 = pd.DataFrame(data = {'C': ['A4','A5','A6','A7'], 'D':['B4','B5','B6','B7']}, index=['K0','K2','K2','K3'])
df1
Output:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
df2
Output:
C D
K0 A4 B4
K2 A5 B5
K2 A6 B6
K3 A7 B7
df1.join(df2, how ='left')
Output:
A B C D
K0 A0 B0 A4 B4
K1 A1 B1 NaN NaN
K2 A2 B2 A5 B5
K2 A2 B2 A6 B6
K3 A3 B3 A7 B7
df1.join(df2,how='right')
Output:
A B C D
K0 A0 B0 A4 B4
K2 A2 B2 A5 B5
K2 A2 B2 A6 B6
K3 A3 B3 A7 B7
(2h) Operations
There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
Info about Unique Values
df['col2'].unique()
Output:
array([444, 555, 666], dtype=int64)
df['col2'].nunique()
Output:
3
df['col2'].value_counts()
Output:
444 2
555 1
666 1
Name: col2, dtype: int64
Selecting Data
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
Output:
col1 col2 col3
3 4 444 xyz
Applying Functions
def times2(x):
return x*2
df['col1'].apply(times2)
Output:
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
df['col3'].apply(len)
Output:
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
df['col1'].sum()
Output:
10
Permanently Removing a Column
del df['col1']
df
Output:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
Get column and index names:
df.columns
Output:
Index(['col2', 'col3'], dtype='object')
df.index
Output:
RangeIndex(start=0, stop=4, step=1)
Sorting and Ordering a DataFrame:
df
Output:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
df.sort_values(by='col2') #inplace=False by default
Output:
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi
Find Null Values or Check for Null Values
df.isnull()
Output:
col2 col3
0 False False
1 False False
2 False False
3 False False
Data Input and Output
This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:
import numpy as np
import pandas as pd
df
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
CSV
CSV Output
df
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
df.to_csv('example',index =False)
CSV input
df = pd.read_csv('example')
df
Output:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
DataFrame/Series object also have plot function to plot values with kind parameter to represent different types of graphs
%matplotlib inline
df
Output:
E F G H
States
CA 0.211382 0.277118 -0.069357 1.751781
NY -1.107231 0.458712 0.533907 1.065361
WY 0.230110 2.127683 0.612683 -0.938353
OR -1.991244 0.719935 1.462831 -1.112004
df.plot(kind='bar')
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x102abe10>
df.plot(kind='hist')
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x10313cb0>
