Source: View original notebook on GitHub
Category: Machine Learning / Learn ML libraries

Pandas

Basically used to do work like in excel.(You can think of pandas as an extremely powerful version of Excel, with a lot more features.)
Pandas(Panel data)

Topics discussed:-

Introduction to Pandas and its Features
Series
DataFrames
- selecting rows and columns (using [] or loc/iloc attribute)
- Adding and removing Rows (using [] to add and drop() to remove)
- Conditional statements on DataFrames
- Setting and Resetting Indexes (use set_index() and reset_index())
- Missing Data (use dropna() to clean data and fillna() to fill missing data)
- GroupBy (groupby())
- convert data to numpy(using df.values)
- Merging,Joining,and Concatenating(merge(), join(), concat())
- Operations (unique(), nunique(), value_counts(), isnull(), sort_values(),apply() column, index etc)
Data Input and Output
Pandas Plotting

Features

pandas - a powerful data analysis and manipulation library for Python
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

It aims to be the fundamental high-level building block for doing practical,real worlddata analysis in Python.

Main Features

Here are just a few of the things that pandas does well:

Easy handling of missing data in floating point as well as non-floating point data.
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects.
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations.
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data.
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
Intuitive merging and joining data sets.
Flexible reshaping and pivoting of data sets.
Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format.

Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

import pandas as pd
import numpy as np

Series and DataFrames are the main modules of Pandas Module we will use

1. pd.Series

basically used to create a hashtable with key,value pair namely (index,data).
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
it can hold any type of data even built_in_functions as well.

import numpy as np
import pandas as pd

data = np.array([1,2,3,4])
label = np.array([0,1,2,3])
label_char = np.array(['shaurya', 'priyansh', 'mamta', 'mukesh'])
hash_table = pd.Series(data) # By default indices will be 0,1,2,3 and so on... 

# providing numbered-indexing as well
hash_table2 = pd.Series(data = data, index = label)


# providing custom-indexing as well
hash_table3 = pd.Series(data = data, index = label_char)


# providing dictionary d
d= {'shaurya':1, 'singhal':2, 'mukesh':3, 'mamta':4}
hash_table4 = pd.Series(d)

daata = list(zip(label_char, data))
hash_table5 = pd.Series(daata)

print(hash_table)
print('----------------')
print(hash_table2)
print('----------------')
print(hash_table3)
print('----------------')
print(hash_table4)
print('----------------')
print(hash_table5)
print('----------------')

Output:

0    1
1    2
2    3
3    4
dtype: int32
----------------
0    1
1    2
2    3
3    4
dtype: int32
----------------
shaurya     1
priyansh    2
mamta       3
mukesh      4
dtype: int32
----------------
shaurya    1
singhal    2
mukesh     3
mamta      4
dtype: int64
----------------
0     (shaurya, 1)
1    (priyansh, 2)
2       (mamta, 3)
3      (mukesh, 4)
dtype: object
----------------

# saving functions
hash_fun = pd.Series(data = [sum,min,max,print,len])

hash_fun

Output:

    &lt;built-in function sum&gt;
    &lt;built-in function min&gt;
    &lt;built-in function max&gt;
  &lt;built-in function print&gt;
    &lt;built-in function len&gt;
dtype: object

Operations are then also done based off of index:

ser1 = pd.Series([1,2,3,4], index = ['USA', 'Germany','USSR', 'Japan'])

ser2 = pd.Series([1,2,5,4], index = ['USA', 'Germany','Italy', 'Japan'])

ser1 + ser2

Output:

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

ser1

Output:

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

ser2

Output:

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

pd.Series?

2. DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index(combine various hashtables with same keys or indices). Let's use pandas to explore this topic!
we will show DataFrames being Series of Series and infact built on numpy ARRAYS

DataFrames have (index,data,columns) in the order data,index,columns

import pandas as pd
from numpy.random import randn 
df = pd.DataFrame(data = randn(4,4), index=['A','B','C','D'], columns=['E','F','G','H'])

df

Output:

E         F         G         H
A  0.211382  0.277118 -0.069357  1.751781
B -1.107231  0.458712  0.533907  1.065361
C  0.230110  2.127683  0.612683 -0.938353
D -1.991244  0.719935  1.462831 -1.112004

data  = np.arange(5)
daata = list(zip(['s','p','ma','mu'], data))
print(daata)
df2 = pd.DataFrame(daata,index = ['A','B','C','D'])

Output:

[('s', 0), ('p', 1), ('ma', 2), ('mu', 3)]

df2

Output:

0  1
A   s  0
B   p  1
C  ma  2
D  mu  3

df

Output:

E         F         G         H
A  0.072135  1.072180 -0.498480  1.228543
B -1.726358 -0.774394 -1.335940  0.078463
C  1.068696  0.198596 -0.986149 -0.622723
D -0.007756 -0.481936 -1.610555  0.667122

(2a)Selecting/Indexing rows(use `loc/iloc attribute`) and colums(use `[]`) of DataFrame

# if we look and select E-th column use:
print(df['E']) # selecting column of DataFrame
print('------')
print(type(df['E']))
# it is a pandas - Series

Output:

A    0.072135
B   -1.726358
C    1.068696
D   -0.007756
Name: E, dtype: float64
------
&lt;class 'pandas.core.series.Series'&gt;

# selecting row
# df['A']  # this is an error 

ref = df.loc
print(ref['A'])
print('---------')
type(ref['A'])
# this is also a pandas- Series

Output:

E    0.072135
F    1.072180
G   -0.498480
H    1.228543
Name: A, dtype: float64
---------
pandas.core.series.Series

# above thing can be combined
print(df.loc['A'])

Output:

E    0.072135
F    1.072180
G   -0.498480
H    1.228543
Name: A, dtype: float64

# can use iloc to select based on numerical index starting with 0 (no Surprise here :| )
print(df.iloc[0]) # same as df.loc['A']

Output:

E    0.072135
F    1.072180
G   -0.498480
H    1.228543
Name: A, dtype: float64

Conclusion

DataFrames are basically Series of Series
since Series are based on numpy ,DatatFrames are as well. lets call shape attribute to it

df.shape

Output:

(4, 4)

# we can select multiple columns/rows (pass list of columns/rows)as well ex-
df[['E','F']]

Output:

E         F
A  0.072135  1.072180
B -1.726358 -0.774394
C  1.068696  0.198596
D -0.007756 -0.481936

df[['E','F']].loc[['C','D']]

Output:

E         F
C  1.068696  0.198596
D -0.007756 -0.481936

# or use iloc
df[['E','F']].iloc[[2,3]]

Output:

E         F
C  1.068696  0.198596
D -0.007756 -0.481936

# grabbing individual element
df.loc['A','E']

Output:

0.07213467134028595

Converting DataFrame into numpy arrays (use `values attribute` on DataFrame object)

# converting DataFrame into numpy arrays
df = pd.DataFrame(data = randn(4,4)) 
print(type(df))
print()
print(df)
print('======================')
print()
df = df.values
print(type(df))
print(df)
print('======================')

Output:

&lt;class 'pandas.core.frame.DataFrame'&gt;

          0         1         2         3
0  0.208281 -0.738167  0.269264  0.028646
1 -1.445427 -0.562245 -2.903674 -0.087284
2 -1.328404 -0.255863 -0.023696 -0.346428
3 -0.737738 -0.947259 -1.132884 -0.592254
======================

&lt;class 'numpy.ndarray'&gt;
[[ 0.20828142 -0.73816711  0.26926379  0.0286463 ]
 [-1.44542694 -0.56224521 -2.90367424 -0.08728398]
 [-1.32840381 -0.25586251 -0.02369633 -0.34642762]
 [-0.73773775 -0.94725893 -1.13288378 -0.59225421]]
======================

(2b)Adding and Removing colums/rows in DataFrames

# Adding
df

Output:

E         F         G         H
A  0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183  0.111722  0.444997
C -1.150753 -0.368783  0.882894 -1.017265
D  0.991462  1.046418 -1.461521 -0.365615

df['new'] = df['E'] + df['F'] # remember hashtable(map in c++) ,this add new column to table

df

Output:

E         F         G         H       new
A  0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B -1.741066 -0.651183  0.111722  0.444997 -2.392248
C -1.150753 -0.368783  0.882894 -1.017265 -1.519535
D  0.991462  1.046418 -1.461521 -0.365615  2.037880

df.loc['new_row'] = df.loc['A'] + df.loc['B'] # adding row using loc/iloc attribute

df

Output:

E         F         G         H       new
A        0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B       -1.741066 -0.651183  0.111722  0.444997 -2.392248
C       -1.150753 -0.368783  0.882894 -1.017265 -1.519535
D        0.991462  1.046418 -1.461521 -0.365615  2.037880
new_row -1.634011 -1.508054 -0.995743  0.375528 -3.142065

# removing -> not inplace or do it by setting that inplace attribute True
df.drop('A', axis=0) # set axis to 0 for rows
# df.drop('E',axis=1) # set axis to 1 for columms

Output:

E         F         G         H       new
B       -1.741066 -0.651183  0.111722  0.444997 -2.392248
C       -1.150753 -0.368783  0.882894 -1.017265 -1.519535
D        0.991462  1.046418 -1.461521 -0.365615  2.037880
new_row -1.634011 -1.508054 -0.995743  0.375528 -3.142065

df # see nothing is dropped ->??

Output:

E         F         G         H       new
A        0.107055 -0.856872 -1.107465 -0.069469 -0.749817
B       -1.741066 -0.651183  0.111722  0.444997 -2.392248
C       -1.150753 -0.368783  0.882894 -1.017265 -1.519535
D        0.991462  1.046418 -1.461521 -0.365615  2.037880
new_row -1.634011 -1.508054 -0.995743  0.375528 -3.142065

# to Drop set inplace =True
df.drop('new',axis=1, inplace=True)

df # now new is removed

Output:

E         F         G         H
A        0.107055 -0.856872 -1.107465 -0.069469
B       -1.741066 -0.651183  0.111722  0.444997
C       -1.150753 -0.368783  0.882894 -1.017265
D        0.991462  1.046418 -1.461521 -0.365615
new_row -1.634011 -1.508054 -0.995743  0.375528

# lets remove new_row as well
df.drop('new_row', inplace=True, axis=0)

df # back to Square One!!

Output:

E         F         G         H
A  0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183  0.111722  0.444997
C -1.150753 -0.368783  0.882894 -1.017265
D  0.991462  1.046418 -1.461521 -0.365615

(2c) Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df

Output:

E         F         G         H
A  0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183  0.111722  0.444997
C -1.150753 -0.368783  0.882894 -1.017265
D  0.991462  1.046418 -1.461521 -0.365615

df>0

Output:

E      F      G      H
A   True  False  False  False
B  False  False   True   True
C  False  False   True  False
D   True   True  False  False

df[df>0]

Output:

E         F         G         H
A  0.107055       NaN       NaN       NaN
B       NaN       NaN  0.111722  0.444997
C       NaN       NaN  0.882894       NaN
D  0.991462  1.046418       NaN       NaN

df[df<1]

Output:

E         F         G         H
A  0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183  0.111722  0.444997
C -1.150753 -0.368783  0.882894 -1.017265
D  0.991462       NaN -1.461521 -0.365615

df

Output:

E         F         G         H
A  0.107055 -0.856872 -1.107465 -0.069469
B -1.741066 -0.651183  0.111722  0.444997
C -1.150753 -0.368783  0.882894 -1.017265
D  0.991462  1.046418 -1.461521 -0.365615

df[df['E']>0] # will select all columns where in E-th columns values are >0

Output:

E         F         G         H
A  0.107055 -0.856872 -1.107465 -0.069469
D  0.991462  1.046418 -1.461521 -0.365615

# to select only F and G th column where Eth values are >0
df[df['E']>0][['F','G']] # why? --->>> since df[df['E']>0] this is also a Dataframes we can chain concept learn till now.

Output:

F         G
A -0.856872 -1.107465
D  1.046418 -1.461521

For two conditions you can use | and & with parenthesis :(dont use `and` `or` as they are use for single elements

not on a series/array of elements)

# to show my point
df['E']>0

Output:

A     True
B    False
C    False
D     True
Name: E, dtype: bool

# to show my point
df['F']>1

Output:

A    False
B    False
C    False
D     True
Name: F, dtype: bool

# df[(df['E']>0) and (df['F'] > 1)] # is an error -> The truth value of a Series is ambiguous.  
df[(df['E']>0) & (df['F'] > 1)] # only Dth column is returned

Output:

E         F         G         H
D  0.991462  1.046418 -1.461521 -0.365615

# similarly
# df[(df['E']>0) or (df['F'] > 1)] # error
df[(df['E']>0) | (df['F'] > 1)]

Output:

E         F         G         H
A  0.107055 -0.856872 -1.107465 -0.069469
D  0.991462  1.046418 -1.461521 -0.365615

(2d) Index setting(using set_index(value)) and Resetting(using reset_index())

df

Output:

E         F         G         H
A  0.211382  0.277118 -0.069357  1.751781
B -1.107231  0.458712  0.533907  1.065361
C  0.230110  2.127683  0.612683 -0.938353
D -1.991244  0.719935  1.462831 -1.112004

# Reset to default 0,1...n index
df.reset_index() # also use inplace=True to make effect permanent

Output:

index         E         F         G         H
   A  0.211382  0.277118 -0.069357  1.751781
   B -1.107231  0.458712  0.533907  1.065361
   C  0.230110  2.127683  0.612683 -0.938353
   D -1.991244  0.719935  1.462831 -1.112004

df # showing resetting was not inplace

Output:

E         F         G         H
A  0.211382  0.277118 -0.069357  1.751781
B -1.107231  0.458712  0.533907  1.065361
C  0.230110  2.127683  0.612683 -0.938353
D -1.991244  0.719935  1.462831 -1.112004

newind = 'CA NY WY OR'.split()
df['States'] = newind # adding new col

df

Output:

E         F         G         H States
A  0.211382  0.277118 -0.069357  1.751781     CA
B -1.107231  0.458712  0.533907  1.065361     NY
C  0.230110  2.127683  0.612683 -0.938353     WY
D -1.991244  0.719935  1.462831 -1.112004     OR

df.set_index('States') # setting different index !!! remenber it overwrittes original indexes!!! also by default not inplace

Output:

E         F         G         H
States                                        
CA      0.211382  0.277118 -0.069357  1.751781
NY     -1.107231  0.458712  0.533907  1.065361
WY      0.230110  2.127683  0.612683 -0.938353
OR     -1.991244  0.719935  1.462831 -1.112004

df

Output:

E         F         G         H States
A  0.107055 -0.856872 -1.107465 -0.069469     CA
B -1.741066 -0.651183  0.111722  0.444997     NY
C -1.150753 -0.368783  0.882894 -1.017265     WY
D  0.991462  1.046418 -1.461521 -0.365615     OR

df.set_index('States',inplace=True)

df

Output:

E         F         G         H
States                                        
CA      0.211382  0.277118 -0.069357  1.751781
NY     -1.107231  0.458712  0.533907  1.065361
WY      0.230110  2.127683  0.612683 -0.938353
OR     -1.991244  0.719935  1.462831 -1.112004

(2e) Missing Data( use `dropna() and fillna()`) use `inplace=True` to make changes Permanent

import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan], 'B':[5,np.nan,np.nan],'C':[1,2,3]}) # passing dict to Dataframe make keys its column
# indexes and data being each cell data with rows indexed using 0,1,2,3,4

df

Output:

A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

df.dropna() # drops row(bydefault axis = 0) with even a single NaN

Output:

A    B  C
0  1.0  5.0  1

df.dropna(axis=1)

Output:

df.dropna(thresh=2) # to set minimum no of NaN's to be found to drop that row(axis=0) or col(axis=1)

Output:

A    B  C
0  1.0  5.0  1
1  2.0  NaN  2

df.fillna(value='FILL VALUE')

Output:

A           B  C
         1           5  1
         2  FILL VALUE  2
FILL VALUE  FILL VALUE  3

df # not doing it inplace till now use inplace=True to do that

Output:

A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

df['A']

Output:

  1.0
  2.0
  NaN
Name: A, dtype: float64

df['A'].mean()

Output:

1.5

df['A'].fillna(value=df['A'].mean())

Output:

  1.0
  2.0
  1.5
Name: A, dtype: float64

(2f) Groupby

The groupby() method allows you to group rows of data together and call aggregate functions

use aggregate() function to apply function over group by object to collect groups
use transform(funct)
use filter(funct)

import pandas as pd
# Create dataframe
data = {'Company':['GOOGLE','GOOGLE','MICROSOFT','MICROSOFT','FACEBOOK','FACEBOOK'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

df = pd.DataFrame(data)

df

Output:

Company   Person  Sales
   GOOGLE      Sam    200
   GOOGLE  Charlie    120
MICROSOFT      Amy    340
MICROSOFT  Vanessa    124
 FACEBOOK     Carl    243
 FACEBOOK    Sarah    350

Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:

df.groupby('Company')

Output:

&lt;pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10283790&gt;

You can save this object as a new variable:

by_comp = df.groupby("Company")

And then call aggregate methods off the object :

by_comp.mean()

Output:

Sales
Company         
FACEBOOK   296.5
GOOGLE     160.0
MICROSOFT  232.0

by_comp.aggregate(np.mean)

Output:

Sales
Company         
FACEBOOK   296.5
GOOGLE     160.0
MICROSOFT  232.0

by_comp.aggregate([np.max,np.min,np.mean, np.std])

Output:

Sales                        
           amax amin   mean         std
Company                                
FACEBOOK    350  243  296.5   75.660426
GOOGLE      200  120  160.0   56.568542
MICROSOFT   340  124  232.0  152.735065

df.groupby('Company').mean()

Output:

Sales
Company         
FACEBOOK   296.5
GOOGLE     160.0
MICROSOFT  232.0

More examples of aggregate methods:

by_comp.std()

Output:

Sales
Company              
FACEBOOK    75.660426
GOOGLE      56.568542
MICROSOFT  152.735065

by_comp.min()

Output:

Person  Sales
Company                  
FACEBOOK      Carl    243
GOOGLE     Charlie    120
MICROSOFT      Amy    124

by_comp.max()

Output:

Person  Sales
Company                  
FACEBOOK     Sarah    350
GOOGLE         Sam    200
MICROSOFT  Vanessa    340

by_comp.count()

Output:

Person  Sales
Company                 
FACEBOOK        2      2
GOOGLE          2      2
MICROSOFT       2      2

# to get all above

by_comp.describe()

Output:

Sales                                                        
          count   mean         std    min     25%    50%     75%    max
Company                                                                
FACEBOOK    2.0  296.5   75.660426  243.0  269.75  296.5  323.25  350.0
GOOGLE      2.0  160.0   56.568542  120.0  140.00  160.0  180.00  200.0
MICROSOFT   2.0  232.0  152.735065  124.0  178.00  232.0  286.00  340.0

by_comp.describe().transpose()

Output:

Company        FACEBOOK      GOOGLE   MICROSOFT
Sales count    2.000000    2.000000    2.000000
      mean   296.500000  160.000000  232.000000
      std     75.660426   56.568542  152.735065
      min    243.000000  120.000000  124.000000
      25%    269.750000  140.000000  178.000000
      50%    296.500000  160.000000  232.000000
      75%    323.250000  180.000000  286.000000
      max    350.000000  200.000000  340.000000

by_comp.describe().transpose()['GOOGLE']

Output:

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOGLE, dtype: float64

# transpose can br done using just T

by_comp.describe().T['GOOGLE']

Output:

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOGLE, dtype: float64

(2g) Mergeing ,Joining and Concatenation

There are three ways of combining two DataFrames namely merging ,joining ,concatenating.

Lets crete dataFrames object first to illustrate it:

import pandas as pd

df1 = pd.DataFrame( {'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3'], 'C':['C0','C1','C2','C3'], 
                   'D' : ['D0','D1','D2','D3']})

df2 = pd.DataFrame( {'A':['A4','A5','A6','A7'], 'B':['B4','B5','B6','B7'], 'C':['C4','C5','C6','C7'], 
                   'D' : ['D4','D5','D6','D7']})


df3 = pd.DataFrame( {'A':['A8','A9','A10','A11'], 'B':['B8','B9','B10','B11'], 'C':['C8','C9','C10','C11'], 
                   'D' : ['D8','D9','D10','D11']})

print(df1)
print('----------')

print(df2)
print('----------')

print(df3)
print('----------')

Output:

A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
----------
    A   B   C   D
0  A4  B4  C4  D4
1  A5  B5  C5  D5
2  A6  B6  C6  D6
3  A7  B7  C7  D7
----------
     A    B    C    D
0   A8   B8   C8   D8
1   A9   B9   C9   D9
2  A10  B10  C10  D10
3  A11  B11  C11  D11
----------

Concatenation using `pd.concat()`

Concatination basically glues together DataFrames.Keep in mind that dimensions should match along the axis u are performing concatenation

df_concat_rows = pd.concat([df1,df2,df3]) # pass any sequence either list,tuple or dict
df_concat_cols = pd.concat([df1,df2,df3],axis=1)

df_concat_rows

Output:

A    B    C    D
 A0   B0   C0   D0
 A1   B1   C1   D1
 A2   B2   C2   D2
 A3   B3   C3   D3
 A4   B4   C4   D4
 A5   B5   C5   D5
 A6   B6   C6   D6
 A7   B7   C7   D7
 A8   B8   C8   D8
 A9   B9   C9   D9
A10  B10  C10  D10
A11  B11  C11  D11

df_concat_cols

Output:

A   B   C   D   A   B   C   D    A    B    C    D
A0  B0  C0  D0  A4  B4  C4  D4   A8   B8   C8   D8
A1  B1  C1  D1  A5  B5  C5  D5   A9   B9   C9   D9
A2  B2  C2  D2  A6  B6  C6  D6  A10  B10  C10  D10
A3  B3  C3  D3  A7  B7  C7  D7  A11  B11  C11  D11

Merging use `pd.merge()`

merging is done based on some key(column) available in both mergeable DataFrames
can be of four type ,default is inner ; others are left ,right,outer.

import pandas as pd

df1 = pd.DataFrame({'KEY':['K0','K1','K2','K3'], 'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3']})

df2 = pd.DataFrame( {'KEY':['K0','K1','K2','K3'], 'A':['A4','A5','A6','A7'], 'B':['B4','B5','B6','B7']})

df1

Output:

KEY   A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
K3  A3  B3

df2

Output:

KEY   A   B
K0  A4  B4
K1  A5  B5
K2  A6  B6
K3  A7  B7

pd.merge(df1,df2, on='KEY')

Output:

KEY A_x B_x A_y B_y
K0  A0  B0  A4  B4
K1  A1  B1  A5  B5
K2  A2  B2  A6  B6
K3  A3  B3  A7  B7

#complicated Examples - lets see all four type of merge

df1 = pd.DataFrame({'KEY1':['K0','K0','K1','K2'],'KEY2':['K0','K1','K0','K1'], 'A': ['A0','A1','A2','A3'],
                    'B':['B0','B1','B2','B3']})
df2 = pd.DataFrame({'KEY1':['K0','K1','K1','K2'],'KEY2':['K0','K0','K0','K0'], 'A':['A4','A5','A6','A7'], 
                    'B':['B4','B5','B6','B7']})

df1

Output:

KEY1 KEY2   A   B
 K0   K0  A0  B0
 K0   K1  A1  B1
 K1   K0  A2  B2
 K2   K1  A3  B3

df2

Output:

KEY1 KEY2   A   B
 K0   K0  A4  B4
 K1   K0  A5  B5
 K1   K0  A6  B6
 K2   K0  A7  B7

pd.merge(df1,df2,on=['KEY1','KEY2']) # can u find out how it is done :/ search on google if not! 

# Hint
# inner is intersection of (key1,key2) of both DF's

Output:

KEY1 KEY2 A_x B_x A_y B_y
 K0   K0  A0  B0  A4  B4
 K1   K0  A2  B2  A5  B5
 K1   K0  A2  B2  A6  B6

pd.merge(df1,df2,on=['KEY1','KEY2'], how='outer') # outer is union

Output:

KEY1 KEY2  A_x  B_x  A_y  B_y
 K0   K0   A0   B0   A4   B4
 K0   K1   A1   B1  NaN  NaN
 K1   K0   A2   B2   A5   B5
 K1   K0   A2   B2   A6   B6
 K2   K1   A3   B3  NaN  NaN
 K2   K0  NaN  NaN   A7   B7

pd.merge(df1,df2,on=['KEY1','KEY2'], how='left')

Output:

KEY1 KEY2 A_x B_x  A_y  B_y
 K0   K0  A0  B0   A4   B4
 K0   K1  A1  B1  NaN  NaN
 K1   K0  A2  B2   A5   B5
 K1   K0  A2  B2   A6   B6
 K2   K1  A3  B3  NaN  NaN

pd.merge(df1,df2,on=['KEY1','KEY2'], how='right')

Output:

KEY1 KEY2  A_x  B_x A_y B_y
 K0   K0   A0   B0  A4  B4
 K1   K0   A2   B2  A5  B5
 K1   K0   A2   B2  A6  B6
 K2   K0  NaN  NaN  A7  B7

Joining using `join()`

joining is basically combining columns over differently - index DataFrames(not based on columns as in merge) and is called upon DataFrame object.
also of four tyoe default being left this time

df1 = pd.DataFrame(data = {'A': ['A0','A1','A2','A3'], 'B':['B0','B1','B2','B3']}, index=['K0','K1','K2','K3'] )
df2 = pd.DataFrame(data = {'C': ['A4','A5','A6','A7'], 'D':['B4','B5','B6','B7']}, index=['K0','K2','K2','K3'])

df1

Output:

A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
K3  A3  B3

df2

Output:

C   D
K0  A4  B4
K2  A5  B5
K2  A6  B6
K3  A7  B7

df1.join(df2, how ='left')

Output:

A   B    C    D
K0  A0  B0   A4   B4
K1  A1  B1  NaN  NaN
K2  A2  B2   A5   B5
K2  A2  B2   A6   B6
K3  A3  B3   A7   B7

df1.join(df2,how='right')

Output:

A   B   C   D
K0  A0  B0  A4  B4
K2  A2  B2  A5  B5
K2  A2  B2  A6  B6
K3  A3  B3  A7  B7

(2h) Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df

Output:

col1  col2 col3
   1   444  abc
   2   555  def
   3   666  ghi
   4   444  xyz

Info about Unique Values

df['col2'].unique()

Output:

array([444, 555, 666], dtype=int64)

df['col2'].nunique()

Output:

df['col2'].value_counts()

Output:

  2
  1
  1
Name: col2, dtype: int64

Selecting Data

#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]

newdf

Output:

col1  col2 col3
3     4   444  xyz

Applying Functions

def times2(x):
    return x*2

df['col1'].apply(times2)

Output:

  2
  4
  6
  8
Name: col1, dtype: int64

df['col3'].apply(len)

Output:

  3
  3
  3
  3
Name: col3, dtype: int64

df['col1'].sum()

Output:

Permanently Removing a Column

del df['col1']

df

Output:

col2 col3
 444  abc
 555  def
 666  ghi
 444  xyz

Get column and index names:

df.columns

Output:

Index(['col2', 'col3'], dtype='object')

df.index

Output:

RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame:

df

Output:

col2 col3
 444  abc
 555  def
 666  ghi
 444  xyz

df.sort_values(by='col2') #inplace=False by default

Output:

col2 col3
 444  abc
 444  xyz
 555  def
 666  ghi

Find Null Values or Check for Null Values

df.isnull()

Output:

col2   col3
False  False
False  False
False  False
False  False

Data Input and Output

This notebook is the reference code for getting input and output, pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:

import numpy as np
import pandas as pd

df

Output:

col1  col2 col3
   1   444  abc
   2   555  def
   3   666  ghi
   4   444  xyz

CSV

CSV Output

df

Output:

col1  col2 col3
   1   444  abc
   2   555  def
   3   666  ghi
   4   444  xyz

df.to_csv('example',index =False)

CSV input

df = pd.read_csv('example')
df

Output:

col1  col2 col3
   1   444  abc
   2   555  def
   3   666  ghi
   4   444  xyz

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

DataFrame/Series object also have plot function to plot values with kind parameter to represent different types of graphs

%matplotlib inline

df

Output:

E         F         G         H
States                                        
CA      0.211382  0.277118 -0.069357  1.751781
NY     -1.107231  0.458712  0.533907  1.065361
WY      0.230110  2.127683  0.612683 -0.938353
OR     -1.991244  0.719935  1.462831 -1.112004

df.plot(kind='bar')

Output

Output:

&lt;matplotlib.axes._subplots.AxesSubplot at 0x102abe10&gt;

df.plot(kind='hist')

Output

Output:

&lt;matplotlib.axes._subplots.AxesSubplot at 0x10313cb0&gt;

Pandas

Topics discussed:-​

Features​

Main Features​

Series and DataFrames are the main modules of Pandas Module we will use​

1. pd.Series

Operations are then also done based off of index:​

2. DataFrames

(2a)Selecting/Indexing rows(use loc/iloc attribute) and colums(use []) of DataFrame​

Conclusion​

Converting DataFrame into numpy arrays (use values attribute on DataFrame object)

(2b)Adding and Removing colums/rows in DataFrames

(2c) Conditional Selection​

For two conditions you can use | and & with parenthesis :(dont use and or as they are use for single elements​

not on a series/array of elements)​

(2d) Index setting(using set_index(value)) and Resetting(using reset_index())

(2e) Missing Data( use dropna() and fillna()) use inplace=True to make changes Permanent​

(2f) Groupby

(2g) Mergeing ,Joining and Concatenation

Concatenation using pd.concat()

Merging use pd.merge()

Joining using join()

(2h) Operations

Info about Unique Values​

Selecting Data​

Applying Functions​

Permanently Removing a Column​

Get column and index names: ​

Sorting and Ordering a DataFrame:​

Find Null Values or Check for Null Values​

Data Input and Output

CSV​

CSV Output​

CSV input

DataFrame/Series object also have plot function to plot values with kind parameter to represent different types of graphs

Topics discussed:-

Features

Main Features

Series and DataFrames are the main modules of Pandas Module we will use

Operations are then also done based off of index:

(2a)Selecting/Indexing rows(use `loc/iloc attribute`) and colums(use `[]`) of DataFrame

Conclusion

Converting DataFrame into numpy arrays (use `values attribute` on DataFrame object)

(2c) Conditional Selection

For two conditions you can use | and & with parenthesis :(dont use `and` `or` as they are use for single elements

not on a series/array of elements)

(2e) Missing Data( use `dropna() and fillna()`) use `inplace=True` to make changes Permanent

Concatenation using `pd.concat()`

Merging use `pd.merge()`

Joining using `join()`

Info about Unique Values

Selecting Data

Applying Functions

Permanently Removing a Column

Get column and index names:

Sorting and Ordering a DataFrame:

Find Null Values or Check for Null Values

CSV

CSV Output