How To Import and Manipulate Large Datasets in Python Using Pandas
A getting started guide to process large datasets in python
As a Python developer, you will often have to work with large datasets. Python is known for being a language that is well-suited to this task.
With that said, Python itself does not have much in the way of built-in capabilities for data analysis. Instead, data analysts make use of a Python library called pandas.
In this article, you will learn how to import and manipulate large datasets in Python using pandas.
How to Import Pandas
To use pandas in a Python script, you will first need to import it. It is convention to import pandas under the alias pd
, like this:
If pandas is not already installed on your machine, you will encounter an error.
Here is how you can install pandas at the command line using the pip package manager:
How To Find Data Sets
Finding data sets to practice on is an important step in growing your skills as a data scientist. I recommend using the UCI Machine Learning repository, which is a repository of free, open-source datasets to practice machine learning on.
We will be using the wine-quality
dataset from the UCI Machine Learning repository in this tutorial. The URL for this data set is listed below:
How To Import .csv
Data Sets
The pandas library includes built-in functions for importing (and saving) data. In the case of the wine-quality
dataset, the original file path leads to a .csv
file. I will explain how to deal with other file types later on in this article.
We can import .csv
files into a Python application using pandas with the read_csv
method, which stores the data in the spreadsheet-like DataFrame object.
As an example, here’s how you would import the wine-quality
data set using the URL that I introduced earlier:
Note that the sep=';'
argument is required because the values within this .csv
file are separated by semicolons, not by commas like what the program expects.
Now that the .csv
file has been imported as a pandas DataFrame, it is very easy to export the file to your local machine using the to_csv
method. For this method, the argument is the name of the file that you’re trying to save.
As an example, here is how you would save the DataFrame as a .csv
file called wine-quality-data.csv
:
If you look in the directory where you ran this Python script, you should now see the wine-quality-data.csv
file!
How To Import .xlsx
. And .json
Data Sets
You will often run into data sets that are not saved as .csv
files. Because of this, I wanted to spend a moment explaining how to import files that have the .xlsx
and .json
file type.
First, let’s start with .xlsx
files. These are typically files that were generated and saved using Microsoft Excel. Just like we used the read_csv
and to_csv
to read and write .csv
files, we use the following two methods to read and write .xlsx
files:
read_excel
to_excel
Files with the .json
file extension are similar, and use the following two methods:
read_json
to_json
In the next several sections of this tutorial, you will learn how to manipulate the data imported from the .csv
file earlier on.
How To Generate a Pandas Series From A Pandas DataFrame
We’ve already referenced the pandas DataFrame object several times, which is one of the cornerstones of the pandas programming library. Pandas includes another object called a Series which is quite similar to a NumPy array.
Pandas includes some very simple syntax that allows you to easily turn a column from a pandas DataFrame into a standalone pandas Series. You simply pass in the name of the column in square brackets at the end of a DataFrame’s variable name.
As an example, here is how you would create a Series called my_new_series
that is equal to the chlorides
column of the wine quality database:
Now that this new pandas Series has been created, you can manipulate the series using the various pandas Series methods included in the library.
How To Work With Missing Data in Pandas
One of the prime frustrations of data scientists is the problem of missing data. Fortunately, pandas makes it very easy to handle missing data when they are contained in pandas DataFrames.
First, the dropna
method can be used to drop every row that contains missing values. Here’s how you would do this with the data
DataFrame we created earlier in this lesson:
If you want to drop columns (instead of rows), you can pass in axis=1
as an argument of the dropna
method, like this:
You can also use the fillna
method to replace missing values with a predetermined value. There are two main approaches to using the fillna
method.
The first approach is to replace missing values with a static value, like 0
. Here’s how you would do this in our data
DataFrame:
The second approach is more complex. It involves replacing missing data with the average value of either:
- The entire DataFrame
- A specific column of the DataFrame
Here’s how you would replace data
’s missing values with the average value of the entire DataFrame:
If you wanted to replace the missing values in data
with the average values of a specific column (say, volatile acidity
), then you could do so with the following code:
Pandas DataFrames Common Operations
Now that you have imported data and dealt with its missing data, let’s discuss a few ways to manipulate data stored in pandas DataFrames.
First, you may want to sort a pandas DataFrame based on its values in a particular column. The sort_values
method is ideal for this. For example, we can sort our data
DataFrame based on the column density
as follows:
There are also several methods that apply specifically pandas Series (which are just columns of a DataFrame, remember):
unique
: eliminates duplicate entries in a pandas Seriesnunique
: counts the number of unique entries in a pandas Seriesvalue_counts
: allows you to count the number of times a specific observation occurs in a pandas Seriesapply
: allows you to apply an outside function to every element of a pandas Series
Final Thoughts
In this tutorial, you learned how to import and manipulate large datasets in Python using pandas
. Please feel free to refer back to this tutorial if you ever get stuck on large datasets in the future.
This article was written by Nick McCullum, who teaches people how to code on his website.