Python – What is a Dataframe?

To totally unlock this section you need to Log-in

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns, exactly as you would see in a classic Excel spreadsheet. In other terms, Pandas DataFrame is nothing but an in-memory representation of an Excel sheet via Python programming language.

Just like an Excel worksheet, Pandas DataFrame provides various functionalities to analyze, change, and extract valuable information from the given dataset.

Features of DataFrame

Potentially columns are of different types
Size – Mutable
Labeled axes (rows and columns)
Can Perform Arithmetic operations on rows and columns

Now we will see not only several ways to create a Pandas dataframe, but also some common ways to manipulate and analyse columns and rows of an imported dataframe, while considering also two common scenario about importing data in CSV and Excel formats into dataframes.

Create Empty Dataframe

In the real world, a Panda DataFrame will be created by loading the datasets from persistent storage, including but not limited to Excel, csv (comma separated values) and MySQL database (not limited to MySQL).

To create and manipulate a dataframe in Python we will use the pandas software library, which is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language and used for many data analysis and data science related applications.

Let's begin by considering an empty dataframe object in Python (we can use Jupyter Notebooks as a professional open-source platform for interactive computing across dozens of programming languages), we can create it with the following code:

# import pandas library as pd 

import pandas as pd 
  
# create an Empty DataFrame object 

df = pd.DataFrame() 
  
print(df)

The following code will add columns and values, for each column, to the already defined df dataframe object:

# Append columns to the Empty DataFrame

df['UserName'] = ['First', 'Second', 'Third']
df['SurName'] = ['Fourth', 'Fifth', 'Sixth']
df['NickName'] = ['Seventh', 'Eighth', 'Nineth']
print(df)

Create Empty Dataframe with only columns

The following code will let us to create an empty DataFrame with columns name only.

# import pandas library as pd 

import pandas as pd 
  
# create an Empty DataFrame 
# object With column names only 

df = pd.DataFrame(columns = ['First', 'Second', 'Third']) 
print(df)

Now, if we want to append to this empty structure some rows we can follow the below approach, using the append() method on the dataframe object already declared:

df = df.append({'First': 1, 'Second': 'Alfa', 'Third': 'Login'}, ignore_index=True)
df = df.append({'First': 2, 'Second': 'Beta', 'Third': 'Logout'}, ignore_index=True)
df = df.append({'First': 3, 'Second': 'Gamma', 'Third': 'Login'}, ignore_index=True)

The ignore_index=True option means that it doesn’t align on the joining axis and the resulting axis will be labeled 0, 1, ..., n - 1.

The definition of the append() method is the following (with options default values, if not specified):

DataFrame.append({data}, ignore_index=False, verify_integrity=False, sort=False)

Create Empty Dataframe with only column names and row indices

The following approach will define only columns headers and row indices. Here we passed the columns & index arguments to Dataframe constructor but without data argument. So, it will create an empty dataframe with all data as NaN.

# Create an empty Dataframe with columns or indices

df = pd.DataFrame(columns=['First', 'Second', 'Third'], index=['0', '1', '2'])
print("Empty Dataframe", df, sep='\n')

Now let's add rows to an empty dataframe at existing indices:

df.loc['0'] = [23, 'First', 'Login']
df.loc['1'] = [24, 'Second', 'Logout']
df.loc['2'] = [25, 'Third', 'Login']
print("Dataframe Contents ", df, sep='\n')

Load a CSV with no headers

Another way to create a dataframe is by importing Excel or CSV files using Python. We can load a CSV file with no header by using the .read_csv Pandas method and specifying the header=None option. Let’s see that in action.

data = pd.read_csv('data.csv', header=None)
data

Load a CSV with specifying column names

If we have a CSV file with many columns but we wanto to import and analyse only a smaller set of these properties we will only load a CSV with specifying column names. See the below code.

data = pd.read_csv('data.csv', names=['City', 'Edition', 'Sport', 'NOC', 'Gender', 'Medal'])
data

The above code only returns the above-specified columns.

Load a CSV specifying full path

You can also read a CSV file from its absolute path. See the example below:

# read csv using absolute path
import pandas as pd
df = pd.read_csv(r"C:\Users\ExampleUser\Example.csv")

Load a CSV from URL

You can also read a CSV file from its URL. Pass the URL to the read_csv() function and it’ll read the corresponding file to a dataframe.

import pandas as pd
df1 = pd.read_csv("https://example.com/example.csv")
df2 = pd.read_csv("https://example.com/example.data")

Load a CSV file and give custom column names

You can give custom column names to your dataframe when reading a CSV file using the read_csv() function. Pass your custom column names as a list to the names parameter.

import pandas as pd
df = pd.read_csv("https://example.com/example.csv",
                 names = ['First', 'Second', 'Third', 'Fourth', 'Fifth'])

Load CSV with a column as index

You can also use a column as the row labels of the dataframe. Pass the column name to the index_col parameter. Going back to the Example.csv we downloaded from Kaggle. Here, we use the Id columns as the dataframe index. You can also pass multiple columns as list to the index_col parameter to be used as row index.

# read csv with a column as index
import pandas as pd
df = pd.read_csv('Example.csv', index_col='Id')

Load only a subset of columns of a CSV

You can also specify the subset of columns to read from the dataset. Pass the subset of columns you want as a list to the usecols parameter. For example, let’s read all the columns from Example.csv except Id column.

# read csv with a column as index
import pandas as pd
df = pd.read_csv('Example.csv', usecols=['First', 'Second', 'Third', 'Fourth', 'Fifth'])

Load only the first n rows of a CSV

You can also specify the number of rows of a file to read using the nrows parameter to the read_csv() function. Particularly useful when you want to read a small segment of a large file.

# read csv with a column as index
import pandas as pd
df = pd.read_csv('Example.csv', nrows=3)

Load an Excel file

NOTE: it is also available the pd.ExcelFile('.../data/example.xls'), that is equivalent to read_excel(ExcelFile, ...) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelFile.parse.html).

We can eead an Excel file using read_excel() method of pandas to directly create a dataframe:

# import pandas lib as pd 
import pandas as pd 
  
# read by default 1st sheet of an excel file 
dataframe1 = pd.read_excel('Sample.xlsx')

We can also specify a specific Worksheet in the Excel to import by using sheet_name parameter of read_excel() method:

# import pandas lib as pd 
import pandas as pd 
  
# read 2nd sheet of an excel file 
dataframe2 = pd.read_excel('Sample.xlsx', sheet_name = 1)

We can specify which columns to import using usecols parameter of read_excel() method:

# import pandas lib as pd 
import pandas as pd 
  
require_cols = [0, 3] 
  
# only read specific columns from an excel file 
required_df = pd.read_excel('SampleWork.xlsx', usecols = require_cols)

Load an Excel file by handling Not Available (N/A) values

A useful action to do before importing an Excel file is handling missing data using na_values parameter of the read_excel() method; in the following example we will import an Excel file, only the Sheet number 2, and we will replace N/A values with the text string "Missing":

# import pandas lib as pd 
import pandas as pd 
  
# Handling missing values of 3rd sheet of an excel file. 
dataframe = pd.read_excel('Sample.xlsx', na_values = "Missing", sheet_name = 2)

Load an Excel file and skip starting rows

Skip starting rows when reading an Excel file by using skiprows parameter of read_excel() method:

# import pandas lib as pd 
import pandas as pd 
  
# read 2nd sheet of an excel file after 
# skipping starting two rows  
df = pd.read_excel('Sample.xlsx', sheet_name = 1, skiprows = 2)

Load an Excel file and read multiple specific sheets

We can read multiple Excel sheets by using sheet_name parameter of the read_excel() method:

# import pandas lib as pd 
import pandas as pd 
  
# read both 1st and 2nd sheet. 
df = pd.read_excel('Sample.xlsx', na_values = "Mssing", sheet_name =[0, 1])

Load an Excel file and read multiple specific sheets

Let's read now all sheets of the Excel file together by using sheet_name parameter of the read_excel() method:

# import pandas lib as pd 
import pandas as pd 
  
# read all sheets together. 
all_sheets_df = pd.read_excel('Sample.xlsx', na_values = "Missing",  sheet_name = None)

Get the number of rows and number of columns in Pandas Dataframe

Pandas provide data analysts a variety of pre-defined functions to get the number of rows and columns in a dataframe. A first way is by using .axes() method: axes() method in pandas allows to get the number of rows and columns in a go. It accepts the argument 0 for rows and 1 for columns.

# import pandas library 

import pandas as pd 
    
# dictionary with list object in values 

details = { 
    'Name' : ['First', 'Second', 'Third', 'Tenth'], 
    'Age' : [23, 21, 22, 21], 
    'University' : ['BHU', 'JNU', 'DU', 'BHU'], 
} 
    
# creating a Dataframe object  

df = pd.DataFrame(details, columns = ['Name', 'Age', 'University'],\ 
                  index = ['a', 'b', 'c', 'd']) 
    
# Get the number of rows and columns 

rows = len(df.axes[0]) 
cols = len(df.axes[1]) 
  
# Print the number of rows and columns 

print("Number of Rows: " + str(rows)) 
print("Number of Columns: " + str(cols))

An alternative way to get these info is by using the df.info() method that provides all the information about the dataframe, including the number of rows and columns:

# import pandas library 

import pandas as pd 
    
# dictionary with list object in values 

details = { 
    'Name' : ['First', 'Second', 'Third', 'Tenth'], 
    'Age' : [23, 21, 22, 21], 
    'University' : ['BHU', 'JNU', 'DU', 'BHU'], 
} 
    
# creating a Dataframe object  

df = pd.DataFrame(details, columns = ['Name', 'Age', 'University'], 
                  index = ['a', 'b', 'c', 'd']) 
    
# Get the info of data frame 

df.info()

If we want to get only the number of columns and/or the number of rows we can use the len() method. In the following example, len(df) will return, as default, the number of rows of the specified dataframe (df):

# import pandas library 

import pandas as pd 
    
# dictionary with list object in values 

details = { 
    'Name' : ['First', 'Second', 'Third', 'Tenth'], 
    'Age' : [23, 21, 22, 21], 
    'University' : ['BHU', 'JNU', 'DU', 'BHU'], 
} 
    
# creating a Dataframe object  

df = pd.DataFrame(details, columns = ['Name', 'Age', 'University'], 
                  index = ['a', 'b', 'c', 'd']) 
    
# Get the number of rows 

print("Number of Rows:", len(df))

The following code, instead, will show only the number of columns, of the same dataframe (len(df.columns)):

# import pandas library 

import pandas as pd 
    
# dictionary with list object in values 

details = { 
    'Name' : ['First', 'Second', 'Third', 'Tenth'], 
    'Age' : [23, 21, 22, 21], 
    'University' : ['BHU', 'JNU', 'DU', 'BHU'], 
} 
    
# creating a Dataframe object  

df = pd.DataFrame(details, columns = ['Name', 'Age', 'University'], 
                  index = ['a', 'b', 'c', 'd']) 
    
# Get the number of columns

print("Number of Columns:", len(df.columns))

We can use also the dataframe.shape to get the count of rows and columns. dataframe.shape[0] and dataframe.shape[1] gives count of rows and columns respectively.

# importing the module 
import pandas as pd 
  
# creating a DataFrame 
dict = {'Name' : ['Martha', 'Tim', 'Rob', 'Georgia'], 
        'Marks' : [87, 91, 97, 95]} 
df = pd.DataFrame(dict) 
  
# displaying the DataFrame 
display(df) 
  
# fetching the number of rows and columns 
rows = df.shape[0] 
cols = df.shape[1] 
  
# displaying the number of rows and columns 
print("Rows: " + str(rows)) 
print("Columns: " + str(cols))

Check whether dataframe is empty

In Python’s pandas, the Dataframe class provides an attribute empty: Dataframe.empty. It returns True if the Dataframe contains no data.

# Create an empty Dataframe

dfObj = pd.DataFrame(columns=['Date', 'UserName', 'Action'])

Now let’s check if it’s empty using empty attribute,
# Check if Dataframe is empty using empty attribute

if dfObj.empty == True:
    print('DataFrame is empty')
else:
    print('DataFrame is not empty')

Output:
DataFrame is empty

If dataframe contains NaN only, then still empty attribute will return False i.e.
# List of Tuples

students = [(np.NaN, np.NaN, np.NaN),
            (np.NaN, np.NaN, np.NaN),
            (np.NaN, np.NaN, np.NaN)
           ]
# Create a DataFrame object

studentsDfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City'])
# Check if Dataframe is empty using empty attribute

if studentsDfObj.empty == True:
    print('Student DataFrame is empty')
else:
    print('Student DataFrame is not empty')

The output of the above will be: Student DataFrame is not empty.

Dataframe contains only NaN but still it contains some data therefore it’s not empty as per empty attribute.

We can achieve the same goal, checking if a dataframe is empty, also by using Dataframe.shape, with the following approach: it returns a tuple containing the dimensions of Dataframe.

Like in case our dataframe has 3 rows and 4 columns it will return (3,4). If our dataframe is empty it will return 0 at 0th index, so 0 as the count of rows. So, we can check if dataframe is empty by checking if value at 0th index is 0 in this tuple.

# Create an empty Dataframe
dfObj = pd.DataFrame(columns=['Date', 'UserName', 'Action'])
# Check if Dataframe is empty using dataframe's shape attribute
if dfObj.shape[0] == 0:
    print('DataFrame is empty')
else:
    print('DataFrame is not empty')

The output of the above will be: DataFrame is empty.

Another way to check if a dataframe is empty is by using Dataframe.index which represents the indices of Dataframe, if dataframe is empty then it’s size will be 0:

# Create an empty Dataframe
dfObj = pd.DataFrame(columns=['Date', 'UserName', 'Action'])
# check if length of index is 0 to verify if dataframe is empty
if len(dfObj.index.values) == 0:
    print('DataFrame is empty')
else:
    print('DataFrame is not empty')

The output of the above will be: DataFrame is empty.

Finally, we can check if a dataframe is empty by using len() on a dataframe object as follows:

### Check if length of dataframe is 0 by calling len on Dataframe
if len(dfObj) == 0:
    print('DataFrame is empty')
else:
    print('DataFrame is not empty')

Output:

The output of the above will be: DataFrame is empty.

Export a DataFrame to the CSV File

In the final part of this introductory article we will see how to export a pandas dataframe to CSV file format. The most basic way to do this is the following form (where df is the dataframe object we want to export):

df.to_csv(r'{FULL_PATH}\FILE_NAME.csv’)

Now, a working example would be the following, in which we will exclude, during the export process, the indices, but we will include the header:

# Import pandas library first
from pandas import DataFrame

# Create DataFrame 
Sample = {'Name': ['First','Second','Third','Fourth'],
            'Age': [20,21,21,20],
            'Score': [89,78,94,56]
           }

# Create DataFrame
df = DataFrame(Sample, columns= ['Name', 'Age', 'Score'])

# Export DataFrame to CSV File
export_csv = df.to_csv (r'C:\Users\ExampleUser\Desktop\Sample.csv', index = None, header=True)

The output file exported to desktop (C:\Users\ExampleUser\Desktop\Sample.csv). We can also use a relative path instead of a full path when specifying the target file location (using the relative path the file will be saved in the current folder/directory in which you are actually executing the python commands.

An additional note about the path we just used:

r'C:\Users\ExampleUser\Desktop\Sample.csv'

We have to notice:

r you should place it before the path name (to take care of any symbols within the path name, such as the backslash symbol). Otherwise, you’ll get the following error: (unicode error) 'unicodeescape' codec can’t decode bytes in position 2-3: truncated \UXXXXXXXX escape.
Sample represents the file name to be created. You may type a different file name if you’d like.
.csv represents the file type, which is csv. You must add that portion anytime you want to export your DataFrame to a CSV file. Alternatively, you may use the file type of txt if you want to export your DataFrame to a text file.

Export a DataFrame to the Excel File

Finally, let's consider how to export a pandas dataframe to a Microsoft Excel file format. We can achieve this by using the the to_excel() method, that is available thanks to the openpyxl library, which is a Python library to read/write Excel xlsx/xlsm/xltx/xltm files.

The openpyxl library can be installed by using pip (pip is a package-management system written in Python used to install and manage software packages). If the openpyxl library is missing while using to_excel() method you will get the ModuleNotFoundError: No module named openpyxl. The to_excel() method uses a library called xlwt and openpyxl internally; xlwt is used to write .xls files (formats up to Excel2003) while openpyxl is used to write .xlsx (Excel2007 or later formats).

pip install openpyxl

A basic example of the to_excel() method is the following:

df.to_excel(r'Path to store the exported excel file\Sample.xlsx', index = False)

Considering a complete example, including the creation of a dataframe directly in Python, the following code will show a working usage of the to_excel() method:

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [32000,35000,37000,45000]
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df.to_excel (r'C:\Users\Ron\Desktop\Export.xlsx', index = False, header=True)

Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. IMPORTANT NOTE: note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased.

If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter object (we will consider two different dataframes, containing different data, called df1 and df2, to be written on a pre-existing Excel workbook):

with pd.ExcelWriter('Output.xlsx') as writer:  
    df1.to_excel(writer, sheet_name='Sheet_name_1')
    df2.to_excel(writer, sheet_name='Sheet_name_2')

Append to an existing Excel file

You can append a DataFrame to an existing Excel file. The code below opens an existing file, then adds two sheets with the data of the dataframes.

Note: Because it is processed using openpyxl, only .xlsx files are included.

path = 'pandas_to_excel.xlsx'

with pd.ExcelWriter(path) as writer:
    writer.book = openpyxl.load_workbook(path)
    df.to_excel(writer, sheet_name='new_sheet1')
    df2.to_excel(writer, sheet_name='new_sheet2')

Best Practices For Spreadsheet Data

Before working with tabular data and it's a good practice to view a sample of the available data and understand whether the following points are in line with the file you plan to work with (to avoid wasting time in the future while working on the full dataset):

The first row of the spreadsheet is usually reserved for the header, which describes what each column's data represents unless the data in the spreadsheet is pixels of images.
Avoid names or values field header with blank spaces or names comprising of multiple words having gaps or spaces between them. Consider using Python's standard PEP-8 format like: underscores, dashes, camel case (the first letter of each section of text is capitalized).
Prefer using short names instead of long names or sentences. Try to avoid using names that contain special characters such as ?, $,%, ^, etc. since special characters do not tell anything about the data.
Your data might have missing values in some columns. Make sure to fill those with NA or fill them with the mean or median of the complete column.

While working with Microsoft Excel, you will find a considerable amount of options to save your file. Besides the default extension .xls or .xlsx, you can go to the File tab, click on Save As and select one of the extensions that are listed as the Save as Type file extension options.

The most commonly used extensions to save datasets for data science are .csv and .txt(as tab-delimited text file) and even .xml. Depending on the saving option that you choose, your data set’s fields are separated by tabs or commas, which will make up the field separator characters of your data set. Knowing the extension of your file is important since when you load the data stored in excel, your Python library would need to explicitly know whether it is a comma-separated or tab-separated file.