Edit me

Pandas Basics

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. By far, the most used python library to handle dataframes.

Scope of Data Science: clean, analyze your data and build models with it. Final objective is to **derive insight **and retrieve useful information.

Personally speaking, if you will ever face a large dataset (>1.000.000 records) I advise you to use Modin that will speed up your operations.

# !pip install pandas
import pandas as pd
# print(pd.__version__)

#Explicit definition of a dataframe

mydataset = {
  'cars': ["BMW", "Volvo", "Ford", "Tesla"],
  'Country': ["Italy", "Spain", "Italy", "Italy"],
  'sales': [3, 7, 2, 1]
}

df_cars = pd.DataFrame(mydataset)

df_cars.head(3)
cars Country sales
0 BMW Italy 3
1 Volvo Spain 7
2 Ford Italy 2
# Alternatively, a dataframe can also be created from lists


auto = ["BMW", "Volvo", "Ford", "Tesla"]
sales =  [3, 7, 2, 1]
country = ["Italy", "Spain", "Italy", "Italy"]
  
# Calling DataFrame constructor after zipping
# both lists, with columns specified
df = pd.DataFrame(list(zip(auto, country, sales)),
               columns =['Auto', "Nazione", "Vendite"])
df
Auto Nazione Vendite
0 BMW Italy 3
1 Volvo Spain 7
2 Ford Italy 2
3 Tesla Italy 1
#Load df from askdata, explicit way

username = "geyos65958@ergowiki.com"
password = "Password"

!pip install askdata
from askdata import Agent, Askdata


askdata = Askdata(username = username, password = password)
agent = askdata.agent("red_wine")
df = agent.load_dataset("red_wine")
# red
# red.head(5)
# red.info()
#red.columns()
# red.dtypes

Create a new column or delete an existing one

df["total acidity"] = df["fixed acidity"] + df["volatile acidity"]
# del df["total acidity"]
df.drop(columns = ["total acidity"]).head(3)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

Exercises module 3

a) create a dataframe with the method of your choice

Learn how to use Pandas data structures in data science to manipulate data (group by, pivot, melt)

Pandas offers a wide variety of different method to transform and manipulate your data. I will provide examples of some functions. For further details please check the Pandas Official Documentation

Sorting

df.sort_values(['quality'], ascending=False)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality total acidity
495 10.7 0.350 0.53 2.60 0.070 5.0 16.0 0.99720 3.15 0.65 11.0 8 11.050
1403 7.2 0.330 0.33 1.70 0.061 3.0 13.0 0.99600 3.23 1.10 10.0 8 7.530
390 5.6 0.850 0.05 1.40 0.045 12.0 88.0 0.99240 3.56 0.82 12.9 8 6.450
1061 9.1 0.400 0.50 1.80 0.071 7.0 16.0 0.99462 3.21 0.69 12.5 8 9.500
1202 8.6 0.420 0.39 1.80 0.068 6.0 12.0 0.99516 3.35 0.69 11.7 8 9.020
... ... ... ... ... ... ... ... ... ... ... ... ... ...
690 7.4 1.185 0.00 4.25 0.097 5.0 14.0 0.99660 3.63 0.54 10.7 3 8.585
1478 7.1 0.875 0.05 5.70 0.082 3.0 14.0 0.99808 3.40 0.52 10.2 3 7.975
899 8.3 1.020 0.02 3.40 0.084 6.0 11.0 0.99892 3.48 0.49 11.0 3 9.320
1299 7.6 1.580 0.00 2.10 0.137 5.0 9.0 0.99476 3.50 0.40 10.9 3 9.180
832 10.4 0.440 0.42 1.50 0.145 34.0 48.0 0.99832 3.38 0.86 9.9 3 10.840

1599 rows × 13 columns

Select cells based on condition

df[df["quality"]==3].head(3)

#This does not affect the original df, if you want to create a new copy just assign it to a new variable
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
251 8.5 0.26 0.21 16.2 0.074 41.0 197.0 0.9980 3.02 0.50 9.8 3
253 5.8 0.24 0.44 3.5 0.029 5.0 109.0 0.9913 3.53 0.43 11.7 3
294 9.1 0.59 0.38 1.6 0.066 34.0 182.0 0.9968 3.23 0.38 8.5 3

Slicing

  • loc gets rows (and/or columns) with particular labels.

  • iloc gets rows (and/or columns) at integer locations.

# df["quality"]
# df.iloc[1]
# df.iloc[1004,3]
# df.loc[1,"fixed acidity"]

EXERCISES

a) Check if the cell in position 1004, column residual sugar and print “Yes” if column residual sugar (3) is greater than 2.0

b) Check if the cell in position 1004, column quality. Print “Yes” if the quality is equal to the one of row 1400

# if df.iloc[1004,3] > 2.0:
#   print(True)

# if df.loc[1004,"residual sugar"] != 0:
#   print(True)

# if df.loc[1004,"quality"] == df.loc[1400,"quality"]:
#   print("Yes")

Filtering

df.loc[(df['quality'] == 5) & (df['volatile acidity'] == 0.700) & (df['residual sugar'] > 1.9)]
# df.loc[(df['quality'] == 5) & (df['volatile acidity'] == 0.700) & (df['residual sugar'] > 1.9)].reset_index(drop=True)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
360 8.2 0.7 0.23 2.00 0.099 14.0 81.0 0.99730 3.19 0.70 9.4 5
1471 6.7 0.7 0.08 3.75 0.067 8.0 16.0 0.99334 3.43 0.52 12.6 5
1519 6.6 0.7 0.08 2.60 0.106 14.0 27.0 0.99665 3.44 0.58 10.2 5

GroupBy

df.groupby(['quality']).count() #mean, sum ....
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
quality
3 10 10 10 10 10 10 10 10 10 10 10
4 53 53 53 53 53 53 53 53 53 53 53
5 681 681 681 681 681 681 681 681 681 681 681
6 638 638 638 638 638 638 638 638 638 638 638
7 199 199 199 199 199 199 199 199 199 199 199
8 18 18 18 18 18 18 18 18 18 18 18

EXERCISES

a) Calculate the average citric acid for red wine having “quality = 3”

b) Calculate the number of observations for every quality class

Unique

names = df["quality"].unique()
names
array([6, 5, 7, 8, 4, 3, 9])

EXERCISES

a) How many different quality classes are represented in the df? Check if we have a quality == 10

Pivot

The pivot() function is used to reshaped a given DataFrame organized by given index / column values. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns.

df_cars
cars Country sales
0 BMW Italy 3
1 Volvo Spain 7
2 Ford Italy 2
3 Tesla Italy 1
df_cars.pivot(index='cars', columns='Country', values='sales')
Country Italy Spain
cars
BMW 3.0 NaN
Ford 2.0 NaN
Tesla 1.0 NaN
Volvo NaN 7.0

Understand techniques for accessing and using files

Different functions to deal with different files.

pd.read_excel('data.xlsx', index_col = 0)
df.to_excel("output.xlsx", index = False)
pd.read_csv()
df.to_csv("output.csv")
# .to_json()
# .to_html()
# .to_sql()
# .to_pickle() ...
# When dealing with huge amount of data better work with chunks

chunk_list = []  # append each chunk df here 

# Each chunk is in df format
for chunk in df_chunk:  
    # perform data filtering 
    chunk_filter = chunk_preprocessing(chunk)
    
    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)
    
# concat the list into dataframe 
df_concat = pd.concat(chunk_list)

Final exercise

For this final exercise, you are going to use the (in)famous titanic dataset. Here is a brief description:

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex  
Age Age in years  
sibsp # of siblings / spouses aboard the Titanic  
parch # of parents / children aboard the Titanic  
ticket Ticket number  
fare Passenger fare  
cabin Cabin number  
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Tasks

a) Load the dataset “titanic” (via Askdata, username = “fosic56191@ppp998.com” , password = “Luiss1234!”)

b) Explore the dataset and get a basic understanding of it (How many columns are numerical, categorical, ecc..)

c) How many passengers survived? What was the average age of the survivors?

d) Which port of Embarkation had the greatest rate of survivors? (hint: help(df.groupby))

e) How many passengers were below 22 years old? How old was the youngest passenger ( hint help(df.min())

f) write a for loop to append values to a brand new list called survivors. If the i-th record survived append “Survived” to the list, else “Not Survived”

g) Create a subset of the dataframe taking only the survivors (Survived == 1) and export it as “survivors.csv”

List comprehensions

Find all of the numbers from 1–1000 that have a 6 in them

Find all of the words in a string that are less than 5 letters

# A
from askdata import Askdata

askdata = Askdata(username = "fosic56191@ppp998.com", password = "Luiss1234!")
agent_name = "titanic"


# C
# df.groupby("Survived").count()
# df[df["Survived"] ==1]["Age"].mean()

# D
# df.groupby(["Embarked", "Survived"]).count()



# E
# df[df["Age"]< 22] 800
# min(df["Age"]) 0.42


#F
# survived = ["Survived" for x in range(len(df)) if df.loc[x,"Survived"]==1]


#G
# temp = df[df["Survived"]==1]
# temp.to_csv("df.csv")

#LIST 1
# q2_answer = [num for num in nums if "6" in str(num)]

#LIST 2
# sentence = ["I love radio rock"]
# words = string.split(" ")
# q5_answer = [word for word in words if len(word) < 5]

Colab