Data Cleaning

Parsing Dates

Step 1: Import the libraries and Dataset (

Landslides )

The first thing we’ll need to do is load in the libraries and datasets we’ll be using. For today, we’ll be working with datasets: containing information on landslides that occurred between 2007 and 2016.

Step 2: Check the data type of our date column

For this part of the challenge, I’ll be working with the date column from the landslides data frame. The very first thing I’m going to do is take a peek at the first few rows to make sure it actually looks like it contains dates.

Yep, those are dates! But just because I, a human, can tell that these are dates doesn’t mean that Python knows that they’re dates. Notice that the at the bottom of the output of head(), you can see that it says that the data type of this column is “object”.

Pandas uses the “object” dtype for storing various types of data types, but most often when you see a column with the dtype “object” it will have strings in it.

If you check the pandas dtype documentation here, you’ll notice that there’s also a specific datetime64 dtypes. Because the dtype of our column is object rather than datetime64, we can tell that Python doesn’t know that this column contains dates.

We can also look at just the dtype of your column without printing the first few rows if we like:

You may have to check the numpy documentation to match the letter code to the dtype of the object. “O” is the code for “object”, so we can see that these two methods give us the same information.

Step 3: Convert our date columns to datetime

Now that we know that our date column isn’t being recognized as a date, it’s time to convert it so that it is recognized as a date. This is called “parsing dates” because we’re taking in a string and identifying its component parts.

We can pandas what the format of our dates are with a guide called as “strftime directive”, which you can find more information on at this link. The basic idea is that you need to point out which parts of the date are where and what punctuation is between them. There are lots of possible parts of a date, but the most common are %d for day, %m for a month, %y for a two-digit year and %Y for a four-digit year.

Some examples:

1/17/07 has the format “%m/%d/%y” 17–1–2007 has the format “%d-%m-%Y”

Looking back up at the head of the date column in the landslides dataset, we can see that it’s in the format “month/day/two-digit year”, so we can use the same syntax as the first example to parse in our dates:

Now that our dates are parsed correctly, we can interact with them in useful ways.

What if I run into an error with multiple date formats? While we’re specifying the date format here, sometimes you’ll run into an error when there are multiple date formats in a single column. If that happens, you have pandas try to infer what the right date format should be. You can do that like so: landslides[‘date_parsed’] = pd.to_datetime(landslides[‘Date’], infer_datetime_format=True)

Why don’t you always use infer_datetime_format = True? There are two big reasons not to always have pandas guess the time format. The first is that pandas won’t always be able to figure out the correct date format, especially if someone has gotten creative with data entry. The second is that it’s much slower than specifying the exact format of the dates.

Select just the day of the month from our column

“This messing around with data types is fine, I guess, but what’s the point?” To answer your question, let’s try to get information on the day of the month that a landslide occurred on from the original “date” column, which has an “object” dtype:

We got an error! The important part to look at here is the part at the very end that says AttributeError: Can only use .dt accessor with datetimelike values. We’re getting this error because the function doesn’t know how to deal with a column with the dtype “object”. Even though our data frame has dates in it, because they haven’t been parsed we can’t interact with them in a useful way.

Luckily, we have a column that we parsed earlier, and that lets us get the day of the month out no problem:

Step 4: Plot the day of the month to check the date parsing

One of the biggest dangers in parsing dates is mixing up the months and days. The to_datetime() function does have very helpful error messages, but it doesn’t hurt to double-check that the days of the month we’ve extracted make sense.

To do this, let’s plot a histogram of the days of the month. We expect it to have values between 1 and 31 and since there’s no reason to suppose the landslides are more common on some days of the month than others, a relatively even distribution. (With a dip on 31 because not all months have 31 days.) Let’s see if that’s the case:

Facebook Comments

More Stuff

Learn how to work with Mentors or anybody whom you are seeking a h... Let’s make one thing clear?—?Need is Yours! You might have heard that having the right mentor, working with better people and learning from right peo...
Adversarially generated Julia sets I thought I’d follow up my first article/tutorial about Julia, by showcasing another side of the language’s ecosystem, libraries for machine learning...
Connecting the dots: 100k protein network graph using AI and GPU-a... A 100,000 node graph that relates 3,900,087 protein fragments through their biophysical featuresSynopsis Combinatorial mutagenesis (CM) is an esta...
How to Fail at Analytics Have you taken the plunge and invested in analytics yet? Are you a large company getting ready to take on a data science initiative? Are you a sma...
Spread the love

Posted by News Monkey