Edit me

Data Cleaning

it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.

We’ll cover the following:

  • Dropping unnecessary columns in a DataFrame
  • Changing the index of a DataFrame
  • detecting and removing na
  • Renaming columns to a more recognizable set of labels
#Load df from askdata, explicit way

username = "geyos65958@ergowiki.com"
password = "Password"

!pip install askdata

from askdata import Agent, Askdata


askdata = Askdata(username = username, password = password)
agent = askdata.agent("red_wine")
df = agent.load_dataset("red_wine")

to_drop = ['Quality']

df.drop(to_drop, axis=1).head(3)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
## Indexing

# Unlike primary keys in SQL, a Pandas Index doesn’t make any guarantee of being unique, 
# although many indexing and merging operations will notice a speedup in runtime if it is.
# It enhances slicing and labeling

#Often you need to take into account the index of a given row... For example:

df1 = df[df["quality"]!=5]
# df1 = df[df["quality"]==5].reset_index(drop=True)

df1
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6
1 7.3 0.65 0.00 1.2 0.065 15.0 21.0 0.99460 3.39 0.47 10.0 7
2 7.8 0.58 0.02 2.0 0.073 9.0 18.0 0.99680 3.36 0.57 9.5 7
3 8.5 0.28 0.56 1.8 0.092 35.0 103.0 0.99690 3.30 0.75 10.5 7
4 7.4 0.59 0.08 4.4 0.086 6.0 29.0 0.99740 3.38 0.50 9.0 4
... ... ... ... ... ... ... ... ... ... ... ... ...
913 6.3 0.51 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
914 6.8 0.62 0.08 1.9 0.068 28.0 38.0 0.99651 3.42 0.82 9.5 6
915 5.9 0.55 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
916 6.3 0.51 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
917 6.0 0.31 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

918 rows × 12 columns

# NAs are nusually not useful when handling data, for this reason you have to identify them and remove them

df = pd.read_csv("wine-red.csv", sep=";")
df.isna().sum()
df.loc[len(df)] = ["32", "2", "2", "32", "1", "1", "1", "1", "1", "2", None, "4"]
# df = df.dropna()
# if you need to replace the name of a column 

osservatorio_data.rename(columns = {'quality':'Wine quality'}, inplace = True)

Detect Outliers

Outliers need a special mention in this section. In statistics, an outlier is an observation point that is distant from other observations. We are not going to cover this since I think some kind of theoretical knowledge is required but keep in mind that this is really important when analyzing data.

Lambda functions

A lambda function is just like any normal python function, except that it has no name when defining it, and it is contained in one line of code. A lambda function evaluates an expression for a given argument. You give the function a value (argument) and then provide the operation (expression). The keyword lambda must come first. A full colon (:) separates the argument and the expression.

#Normal python function
def a_name(x):
    return x+x

#Lambda function
lambda x: x+x

Pros

  • Good for simple logical operations that are easy to understand. This makes the code more readable too.
  • Good when you want a function that you will use just one time.

Cons

  • They can only perform one expression. It’s not possible to have multiple independent operations in one lambda function.
  • Bad for operations that would span more than one line in a normal def function (For example nested conditional operations). If you need a minute or two to understand the code, use a named function instead.
  • Bad because you can’t write a doc-string to explain all the inputs, operations, and outputs as you would in a normal def function.
(lambda x: x*2)(25)
50
## Lambda function with filter
# This is a Python inbuilt library that returns only those values that fit certain criteria. 

list_1 = [1,2,3,4,5,6,7,8,9]
list(filter(lambda x: x%2==0, list_1))
[2, 4, 6, 8]
## Lambda function with map
# This returns a modified list where every value in the original list has been changed based on a function.

list_1 = [1,2,3,4,5,6,7,8,9]
cubed = map(lambda x: pow(x,3), list_1)
list(cubed)
[1, 8, 27, 64, 125, 216, 343, 512, 729]

Exercises

a) Write a Python program to create a lambda function that adds 15 to a given number passed in as an argument, also create a lambda function that multiplies argument x with argument y and print the result.

b) given this list nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] filter a list of integers using Lambda.

# r = lambda a : a + 15
# print(r(10))
# r = lambda x, y : x * y
# print(r(12, 4))

# nums = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# print("Original list of integers:")
# print(nums)
# print("\nEven numbers from the said list:")
# even_nums = list(filter(lambda x: x%2 == 0, nums))
# print(even_nums)

Colab