Python – Pandas – Converting XLSX to Dictionary (key and values)


To totally unlock this section you need to Log-in

During data import process in a Jupyter Notebook, using Python and Pandas module for data science, we would need to manipulate or import directly an Excel file into a notebook and transfor all the data into a dictionary, so in this article we will focus on this particular need.

Let's say we have an Excel file with four columns, City, Country, Population and Area: now that we have this example file in the folder we are working with, we can get this data into Python code in a few lines.

import pandas
travel_df = pandas.read_excel('./cities.xlsx')
cities = travel_df.to_dict('records')
cities[0]
{'City': 'Buenos Aires',
 'Country': 'Argentina',
 'Population': 2891,
 'Area': 203}

Press shift+enter to run the code.

The code above relies on using an outside library called pandas, as it's good at reading Excel files. A library is just a set of reusable functions. The pandas library is available for free online. We tell our current Jupyter notebook that we are about to use it with the line import pandas.

And that gives us an object, like a dictionary, which has a method in it called read_excel. Similar to how we can call {'foo': 'bar'}.keys(). That's the benefit of a library, we can get methods that do not come out of the box with Python.

So we use the read_excel data to read our excel file, by providing the name of the file, cities.xlsx, and the preceding ./ just indicates that the file is found in the current folder. Finally with the line travel_df.to_dict('records') we return a list of our dictionaries representing our data. You can see that when we access the first element of this list, it returns our first dictionary.

The below is the complete code, with output.

# Here we use a library, which is some code not part of standard Python, to make this process easier 
import pandas
# If we use the `import pandas` we have access to the pandas library 
travel_df = pandas.read_excel('./cities.xlsx')
# We call the pandas.read_excel method and pass through the string './cities.xlsx' as the file is called cities.xlsx.  By saying './' we are saying 
# go to the current folder, excel-to-python, and find the 'cities.xlsx' file there
cities = travel_df.to_dict('records')
cities
[{'City': 'Buenos Aires',
  'Country': 'Argentina',
  'Population': 2891,
  'Area': 203},
 {'City': 'Toronto', 'Country': 'Canada', 'Population': 2732, 'Area': 630},
 {'City': 'Marakesh', 'Country': 'Morocco', 'Population': 929, 'Area': 230},
 {'City': 'Albuquerque', 'Country': 'USA', 'Population': 559, 'Area': 491},
 {'City': 'Los Cabos', 'Country': 'Mexico', 'Population': 288, 'Area': 3751},
 {'City': 'Greenville', 'Country': 'USA', 'Population': 93, 'Area': 68},
 {'City': 'Archipelago Sea',
  'Country': 'Finland',
  'Population': 60,
  'Area': 2000},
 {'City': 'Pyeongchang',
  'Country': 'South Korea',
  'Population': 44,
  'Area': 1464},
 {'City': 'Walla Walla Valley',
  'Country': 'USA',
  'Population': 33,
  'Area': 35},
 {'City': 'Salina Island', 'Country': 'Italy', 'Population': 3, 'Area': 26},
 {'City': 'Solta', 'Country': 'Croatia', 'Population': 2, 'Area': 59},
 {'City': 'Iguazu Falls',
  'Country': 'Argentina',
  'Population': 0,
  'Area': 2396}]

Look at that. Our variable cities is full of cities from the spreadsheet.

Now that we have the data in Python, we can use a couple of other of functions to quickly explore our data. For example, let's say that we want to remind ourselves of all of the attributes associated with a single city. If we just look at the first element we see both the keys and values.

cities[0]
{'City': 'Buenos Aires',
 'Country': 'Argentina',
 'Population': 2891,
 'Area': 203}

But, we really only need to see the keys.

cities[0].keys()
dict_keys(['City', 'Country', 'Population', 'Area'])

Note that the keys() function returns a dict_keys object. It's a little tricky to work with that type of object, so let's coerce it into a list.

list(cities[0].keys())
['City', 'Country', 'Population', 'Area']

Much better.

If we would like to also see our values of a dictionary in a list, we can do something similar with the .values() function for a dictionary.

cities[0].values()
dict_values(['Buenos Aires', 'Argentina', 2891, 203])
list(cities[0].values())
['Buenos Aires', 'Argentina', 2891, 203]

Once again, we call the method, and then coerce it into a list, by using the list function.

Creating Dictionaries

So far, we have seen one way of creating dictionaries:

philadelphia = {'City': 'Philadelphia'}

We can do this even with another approach:

pittsburgh = dict(city='Pittsburgh')
pittsburgh
{'city': 'Pittsburgh'}

As you can see, by using the keyword dict, we can pass through the name of the key followed by the equal sign. Notice that the key name is provided as a string, but Python still knows how to handle it.

Let's do one more:

dict(city="Las Vegas", state="Nevada")
{'city': 'Las Vegas', 'state': 'Nevada'}

Alternative Method

Another easy way to convert a spreadsheet to Python dictionary is to use parse method, always from Pandas library, to get Excel data to a dataframe and finally pass it to the to_dict method to transform the dataframe into a dictionary. You can use these like in the following code:

from pandas import *
xls = ExcelFile('my_file.xls')
data = xls.parse(xls.sheet_names[0])
print(data.to_dict())

XLRD Method

To use the following solution (function/method) we will have to install, if not already available in your Jupyter Notebook installation, the xlrd library and then import it, like pandas.

We can use the following code to install xlrd library:

pip install xlrd

But, if we are using conda, let's use:

conda install -c anaconda xlrd

Below there is a function approach, using xlrd library, to import and transform Excel data into a dictionary:

import xlrd
def make_json_from_data(column_names, row_data):
    ### Take column names and row info and merge into a single json object.
    row_list = []
    for item in row_data:
        json_obj = {}
        for i in range(0, column_names.__len__()):
            json_obj[column_names[i]] = item[i]
        row_list.append(json_obj)
    return row_list

def xls_to_dict(workbook_url):
    ### Convert the read xls file into JSON.
    ### workbook_url: Fully Qualified URL of the xls file to be read.
    ### workbook_dict: json representation of the workbook.
    workbook_dict = {}
    book = xlrd.open_workbook(workbook_url)
    sheets = book.sheets()
    for sheet in sheets:
        if sheet.name == 'PortHoles & Discrete Appurtenan':
            continue
        workbook_dict[sheet.name] = {}
        columns = sheet.row_values(0)
        rows = []
        for row_index in range(1, sheet.nrows):
            row = sheet.row_values(row_index)
            rows.append(row)
        sheet_data = make_json_from_data(columns, rows)
        workbook_dict[sheet.name] = sheet_data
    return workbook_dict
    
# Sample Call: 
sample = xls_to_dict('/home/root/workbook.xls')
print(sample)