Saturday, March 11, 2023
HomePythonWorking With Messy Knowledge Utilizing Pandas in Python

Working With Messy Knowledge Utilizing Pandas in Python

Hello beautiful folks! 👋 I’m engaged on a mission and needed to wrangle some messy knowledge utilizing Pandas. New mission means new concepts for articles and so right here I’m with an article about the best way to work with messy knowledge utilizing Pandas in Python. I can be masking solely among the fundamentals that you will want to know for many of your knowledge evaluation initiatives. So with out losing any time let’s start!


My dataset is a CSV file containing a seasonal recreation schedule for a selected basketball crew. The primary aim is to course of this knowledge utilizing Pandas, substitute sure information, take away sure rows, and sum sure columns. Briefly, your typical knowledge processing stuff. You will get the enter file from right here. That is what the file appears like:
Input Data

Studying the info right into a Pandas DataFrame

This knowledge is in a CSV file named input_file.csv. We will make use of the read_csv technique of the Pandas library to load it in. Usually you’ll learn a CSV like this:

import pandas as pd
schedule_dataframe = pd.read_csv('input_file.csv')

That is the output once we strive printing the DataFrame:

>>> print(schedule_dataframe)
                                   Tue Nov 5    234  ...    .1 Unnamed: 11
0                                  Sat Nov 9  235.0  ...               NaN
1                                 Wed Nov 13  240.0  ...               NaN
2                                 Sat Nov 16  232.0  ...               NaN
3                                 Solar Nov 24  224.0  ...               NaN
4                                 Wed Nov 27  251.0  ...               NaN
5                                 Fri Nov 29  244.0  ...               NaN
6                                 Sat Nov 30  258.0  ...               NaN
7                                  Wed Dec 4  259.0  ...               NaN
8                                  Sat Dec 7  257.0  ...               NaN
9                                 Wed Dec 11  261.0  ...               NaN
10                                Sat Dec 14  271.0  ...               NaN
11                                Solar Dec 22  284.0  ...               NaN
12                                Sat Dec 28  283.0  ...               NaN
13                                 Fri Jan 3  276.0  ...               NaN
14                                 Thu Jan 9  277.0  ...   0-1         NaN
15                                Sat Jan 11  278.0  ...   0-2         NaN
16                                Wed Jan 15  279.0  ...   0-3         NaN
17                                Sat Jan 18  287.0  ...   1-3         NaN
18                                Wed Jan 22  284.0  ...   2-3         NaN
19                                Sat Jan 25  273.0  ...   3-3         NaN
20                                Thu Jan 30  269.0  ...   3-4         NaN
21                                 Sat Feb 1  267.0  ...   3-5         NaN
22                                 Wed Feb 5  276.0  ...   4-5         NaN
23                                 Sat Feb 8  262.0  ...   4-6         NaN
24                                Sat Feb 15  265.0  ...   5-6         NaN
25                                Thu Feb 20  272.0  ...   5-7         NaN
26                                Thu Feb 27  275.0  ...   5-8         NaN
27                                Sat Feb 29  275.0  ...   5-9         NaN
28                                 Thu Mar 5  274.0  ...  5-10         NaN
29                                 Sat Mar 7  268.0  ...  6-10         NaN
30         Residence video games performed at Titan Fitness center (4    0.0  ...   NaN         NaN
31  Residence courtroom benefit: 2.1 (337th in D-I)    NaN  ...   NaN         NaN

[32 rows x 12 columns]

There are a few vital issues to notice right here. Pandas doesn’t assign a header to this DataFrame. It tries to determine the header mechanically however it makes a judgment name that there isn’t a header within the CSV. In such instances, we will provide a customized header.

We will assign customized column names (header) as a part of the read_csv technique. There are 12 columns in my CSV so I’ll present a listing of 12 names:

schedule_dataframe = pd.read_csv('input_file.csv',
                                names=['date', 'v', 'w', 'competitor_name', 
                                        'result', 'score', 'x', 'y', 'location', 
                                        'record', 'conf_record', 'z'])

The one factor you want to bear in mind is that the names should be distinctive. I don’t care in regards to the u, v, x, y, or z columns. I don’t know precisely what knowledge they comprise so that’s the reason the names aren’t significant.

Now if we attempt to print this DataFrame we will see a header within the output:

Unprocessed Data

Changing knowledge

For my mission, I wanted to take away these rows from the info whose w column contained “NR”. Pandas supplies us a way to drop these rows whose sure columns comprise NaN values. Nonetheless, NR just isn’t equal to NaN so first, we have to substitute each incidence of NR with NaN. We will simply try this through the use of the substitute technique of the DataFrame:

import numpy as np
schedule_dataframe['w'] = schedule_dataframe['w'].substitute('NR', np.nan)

Now we will use the dropna technique of the DataFrame to simply take away the rows whose w column incorporates NaN:

schedule_dataframe = schedule_dataframe.dropna(axis=0, subset=['w'])

The axis=0 tells Pandas to drop the row as an alternative of the column. If we strive printing the DataFrame now, the output will look one thing like this:

Processed Data

Whereas eradicating the rows, Pandas doesn’t replace the index column. I don’t like that so whereas we’re at it, let’s reset the index column as effectively in order that we’ve a steady vary of index values:

schedule_dataframe = schedule_dataframe.reset_index(drop=True)

The output ought to now comprise up to date index values:

Processed Data

Eradicating a row from the DataFrame

I don’t need the z column in my DataFrame because it incorporates solely NaN values so let’s drop that column from my DataFrame:

schedule_dataframe = schedule_dataframe.drop(['z'], axis=1)

Candy! Now the output appears a lot cleaner:

Processed Data

Throughout this step, I went forward and eliminated the final row as effectively. I didn’t want it.

Changing column right into a NumPy array

Now our w column solely incorporates integer values. Let’s say I needed to check that complete sorted column with another checklist. How can we try this? Because it seems, Pandas makes it extremely straightforward for us to do precisely that. We simply want to make use of the to_numpy technique.

Right here is the way you do it:

scores = schedule_dataframe['w'].to_numpy()

# array(['13', '43', '246', '108', '340', '305', '151', '120', '183', '230',
#       '74', '209', '78', '208', '211', '106', '297', '225', '233', '315',
#       '106', '211', '225', '315', '160', '208', '233', '160', '297'],
#      dtype=object)

Now you possibly can kind them and do no matter you need with them.


That is the place this brief tutorial involves an finish. Pandas is admittedly highly effective and this tutorial doesn’t do it justice. Strive it on any knowledge processing or knowledge evaluation duties and let me know the way it goes. There’s a motive it’s an integral a part of any knowledge analysts’ device chest. I hope you realized one thing new as we speak. I’ll see you within the subsequent article. Keep secure and take care! 👋 ❤️



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments