Python-100-days-course (Day73 - Advance - Data Visualisation with Matplotlib programming language)

Goals: what you will make by the end of the day

Analyse the Popularity of Different Programming Languages over Time

The oldest programming language still in use today is FORTRAN, which was developed in 1957. Since then many other programming languages have been developed. But which programming language is the most popular? Which programming language is the Kim Kardashian of programming languages; the one people just can't stop talking about?

StackOverflow will help us answer this burning question. Each post on Stack OverFlow comes with a Tag. And this Tag can be the name of a programming language.

To figure out which language is the most popular, all we need to do is count the number of posts on Stack Overflow that are tagged with each language. The language with the most posts wins!

Today you will learn:
- How to visualise your data and create charts with Matplotlib
- How to pivot, group and manipulate your data with Pandas to get it into the format you want
- How to work with timestamps and time-series data
- How to style and customise a line chart to your liking
Without further ado, let's get started!
Download and Open the Starter Notebook

Open the Template

To help you with your Data Science journey I'll provide you with starter notebook which already has some of the sections and challenges laid out. Download the .zip file from this lesson's resources, unzip it, and upload it to Google Drive. There you can open the file as a Colab Notebook.

Import the Data

In this lesson, I've also included a QueryResults.csv file with the Stack Overflow data that we'll be using. Download this .csv and add it to your notebook.

Challenge

For the next steps, let's review the data exploration that we've done yesterday:
- Read the .csv file and store it in a Pandas DataFrame called df. Have a look at the read_csv() documentation and try to provide these column names: ['DATE', 'TAG', 'POSTS']
- Look at the first and last 5 rows of the DataFrame.
- How many rows and how many columns does it have?
- Count the number of entries in each column.
Solution: Analysis by Programming Language

I hope the last steps were fairly straightforward. First, we import pandas and then we can call read_csv(), where we can provide some additional arguments, like the names for our columns.

1. df = pd.read_csv('QueryResults.csv', names=['DATE', 'TAG', 'POSTS'], header=0)

Setting the header row to 0 allows us to substitute our own column names.

Next, we use .head() and .tail() to look at the first and last 5 rows. This allows us to verify that our column naming worked as intended.

To check the dimensions of the DataFrame, we use our old friend .shape. This tells us we have 1991 rows and 3 columns.

To count the number of entries in each column we can use .count(). Note that .count() will actually tell us the number of non-NaN values in each column.

Next Challenge

The TAG is the name of the programming language. So for example in July 2008, there were 3 posts tagged with the language C#. Given that the TAG serves as our category column, can you figure out how to count the number of posts per language? Which programming language had the most number of posts since the creation of Stack Overflow? (Hint: you may need to review one of yesterday's lessons).

Also, some languages are older like C and other languages are newer (like Swift). The dataset starts in July 2008, so some languages will not have any posts for every month. Can you count how many months of posts exist for each programming language?
Data Cleaning: Working with Time Stamps

In order to look at the number of entries and the number of posts by programming language, we need to make use of the .groupby() method. The key is combining .groupby() with the TAG column, which holds as our categories (the names of the programming languages).

If we .sum() the number of posts then we can see how many posts each programming language had since the creation of Stack Overflow.

If we .count() the entries in each column, we can see how many months of entries exist per programming language.
Data Mainpulation: Pivoting DataFrames

Selecting an Individual Cell

Let's take a closer look at the 'DATE' column in our DataFrame. We can use the double square bracket notation to look at the second entry in the column:

1. df['DATE'][1]

Alternatively, for column names no spaces, we can also use the dot-notation:

1. df.DATE[1]

I prefer the square bracket notation for column names since it's more flexible, but with the dot notation, you get to use autocomplete, which is also nice.

Inspecting the Data Type

When we type check the contents of this cell, we see that we are not dealing with a date object, but rather with a string.

This is not very handy. Not only will the string format always show the unnecessary 00:00:00, but we also don't get the benefit of working with Datetime objects, which know how to handle dates and times. Pandas can help us convert the string to a timestamp using the to_datetime() method.

Here's how we can convert the entry in our cell and check that it worked:

Let's use Pandas' to_datetime() to convert the entire df['DATE'] column.

Excellent. Now we can start thinking about how to manipulate our data so that we get a one column per programming language. For all of that and more, I'll see you in the next lesson.
Data Visualisation with Matplotlib

The .pivot() method

Sometimes you want to convert your DataFrame so that each category has its own column. For example, suppose you needed to take the table below and create a separate column for each actor, where each row is the Age of the actor:

How would you do this with the DataFrame below?

1. test_df = pd.DataFrame({'Age': ['Young', 'Young', 'Young', 'Young', 'Old', 'Old', 'Old', 'Old'], 2. 'Actor': ['Jack', 'Arnold', 'Keanu', 'Sylvester', 'Jack', 'Arnold', 'Keanu', 'Sylvester'], 3. 'Power': [100, 80, 25, 50, 99, 75, 5, 30]}) 4. test_df

The easiest way to accomplish this is by using the .pivot() method in Pandas. Try the example for yourself. The thing to understand is how to supply the correct aguments to get the desired outcome. The index are the categories for the rows. The columns are the categories for the columns. And the values are what you want in the new cells.

1. pivoted_df = test_df.pivot(index='Age', columns='Actor', values='Power') 2. pivoted_df

However, there's one very important thing to notice. What happens if a value is missing? In the example above there's no value for old Sylvester. In this case, the .pivot() method will insert a NaN value.

Mini-Challenge
- Can you pivot the df DataFrame so that each row is a date and each column is a programming language? Store the result under a variable called reshaped_df.
- Examine the dimensions of the reshaped DataFrame. How many rows does it have? How many columns?
- Examine the head and the tail of the DataFrame. What does it look like?
- Print out the column names.
- Count the number of entries per column.
You should get something like this:

Solution

Here's how you pivot our existing DataFrame to get the outcome above:

1. reshaped_df = df.pivot(index='DATE', columns='TAG', values='POSTS')

We have 145 rows and 14 columns in the new DataFrame. Each programming language became a column and our date column became the new index (i.e., the label for the rows).

When we count the number of entries per column we see that not all languages are the same. The reason is that the .count() method excludes NaN values. When we pivoted the DataFrame the NaN values were inserted when there were no posts for a language in that month (e.g., Swift in July, 2008).

Dealing with NaN Values

In this case, we don't want to drop the rows that have a NaN value. Instead, we want to substitute the number 0 for each NaN value in the DataFrame. We can do this with the .fillna() method.

1. reshaped_df.fillna(0, inplace=True)

The inplace argument means that we are updating reshaped_df. Without this argument we would have to write something like this:

1. reshaped_df = reshaped_df.fillna(0)

Let's check if we successfully replaced all the NaN values in our DataFrame.

We can also check if there are any NaN values left in the entire DataFrame with this line:

1. reshaped_df.isna().values.any()

Here we are using the .isna() method that we've used before, but we're chaining two more things: the values attribute and the any() method. This means we don't have to search through the entire DataFrame to spot if .isna() is True.

Now we're all set to create some charts and visualise our data. For all of that and more, I'll see you in the next lesson!
Multi-Line Charts with Matplotlib
Smoothing out Time-Series Data
Programming Language Data Analysis
Learning Points & Summary

Challenge

Next Challenge