Data visualization is the discipline of trying to understand data by placing it in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed. Python offers multiple great graphing libraries that come packed with lots of different features.
Matplotlib is one of the most used libraries: is a comprehensive library for creating static, animated, and interactive visualizations in Python.
## Install Askdata via PyPi !pip install askdata # Load modules from askdata import Agent, Askdata # Authenticate to Askdata (the system will ask for e-mail and password) askdata = Askdata() #Load dataset agent = askdata.agent("red_wine") df = agent.load_dataset("red_wine") df
To create a scatter plot in Matplotlib we can use the scatter method. We will also create a figure and an axis using plt.subplots so we can give our plot a title and labels.
import matplotlib.pyplot as plt # create a figure and axis fig, ax = plt.subplots() ax.scatter(df['Quality'], df['Ph']) # set a title and labels # ax.set_title('Red wine Dataset') # ax.set_xlabel('Quality') # ax.set_ylabel('pH') fig.show()
Data visualization is often crucial if you want to transmit someone a message. For this reason, it is always reccomended investing your time in this activity. When I am in need of fancy graphs, I always visit this website which offers a huge variety of graphs with ready to use code snippets.
### Bar plot # create a figure and axis fig, ax = plt.subplots() # count the occurrence of each class data = df['Quality'].value_counts() # get x and y data points = data.index frequency = data.values # create bar chart ax.bar(points, frequency) # set title and labels ax.set_title('Wine Review Quality') ax.set_xlabel('Points') ax.set_ylabel('Frequency') fig.show()
Correlation, in the finance and investment industries, is a statistic that measures the degree to which two securities move in relation to each other. Its value must fall between -1.0 and +1.0.Since correlation is really important in ML algorithms, we can analyze it:
f = plt.figure(figsize=(19, 15)) plt.matshow(df.corr(), fignum=f.number) plt.xticks(range(df.select_dtypes(['number']).shape), df.select_dtypes(['number']).columns, fontsize=14, rotation=45) plt.yticks(range(df.select_dtypes(['number']).shape), df.select_dtypes(['number']).columns, fontsize=14) cb = plt.colorbar() cb.ax.tick_params(labelsize=14) plt.title('Correlation Matrix', fontsize=16);
Why is correlation important?
Multicollinearity refers to a situation in which more than two explanatory variables in a multiple regression model are highly linearly related. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.