Saturday, February 4, 2017

Statistics : Computing Mean,Variance, Percentiles, Correlations in Python



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.basemap import Basemap
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'
In [2]:
iris = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")
In [6]:
iris.head()
Out[6]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Computing Mean

In [5]:
mean_petal = np.mean(iris['petal_length'])
print("Mean", mean_petal)
Mean 3.7586666666666693

Computing Percentiles

In [7]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""

    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y
In [8]:
x,y = ecdf(iris["petal_length"])
plt.plot(x,y, marker = '.', linestyle = 'none')
plt.margins(0.02)
plt.xlabel('petal length (cm)')
plt.ylabel("ECDF")
Out[8]:
<matplotlib.text.Text at 0x11ce63630>
In [11]:
percentiles = np.array([2.5,5,10,20,50,75,90])
val = np.percentile(iris["petal_length"], percentiles)
print(val)
[ 1.2725  1.3     1.4     1.5     4.35    5.1     5.8   ]
In [15]:
x,y = ecdf(iris["petal_length"])
plt.plot(x,y, marker = '.', linestyle = 'none')
plt.plot(val, percentiles/100, marker ='D', color = 'red', linestyle='none')
plt.margins(0.02)
plt.xlabel('petal length (cm)')
plt.ylabel("ECDF")
Out[15]:
<matplotlib.text.Text at 0x11da137b8>

Box plot

In [17]:
sns.boxplot("species", "petal_length", data = iris)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aa665c0>

Computing variance

In [32]:
total_var = np.var(iris["petal_length"])
print(total_var)
pd.pivot_table(iris, index="species",values="petal_length", aggfunc='var' )
3.0924248888888854
Out[32]:
species
setosa        0.030106
versicolor    0.220816
virginica     0.304588
Name: petal_length, dtype: float64

Computing standard deviation

In [33]:
total_std = np.std(iris["petal_length"])
print(total_std)
pd.pivot_table(iris, index="species",values="petal_length", aggfunc='std' )
1.7585291834055201
Out[33]:
species
setosa        0.173511
versicolor    0.469911
virginica     0.551895
Name: petal_length, dtype: float64

Understanding spread distributions

Let's consider there are two data sets "a" and "b"
In [20]:
a = np.array([-10,0,10,20,30])
b = np.array([8,9,10,11,12])
Computing mean for "a" and "b"
In [21]:
mean_a = np.mean(a)
mean_b = np.mean(b)
In [22]:
mean_a
Out[22]:
10.0
In [23]:
mean_b
Out[23]:
10.0
Here, the mean is same for both the data set, but if we examine closely, it's clearly seen that the dataset "b" is arranged closely and the dataset "a" is more dispersed
Range of the dataset
Range is the difference between the smallest value and the largest value in the dataset
In [27]:
range_a = np.max(a) - np.min(a)
range_b = np.max(b) - np.min(b)
In [28]:
range_a
Out[28]:
40
In [29]:
range_b
Out[29]:
4
Variance of the dataset
Variance is average of sum of the squared distances from all the datapoint to mean of the dataset
In [30]:
var_a = np.var(a)
var_b = np.var(b)
In [31]:
print(var_a)
print(var_b)
200.0
2.0
By comparing the variance of both the dataset, we can confirm that the dataset "b" is less dispersed.
Standard deviation of the dataset
Standard deviation is the square root of the variance
In [34]:
std_a = np.std(a)
std_b = np.std(b)
In [35]:
print(std_a)
print(std_b)
14.1421356237
1.41421356237

Correlation between two variables

In [37]:
plt.scatter(iris["petal_length"], iris["petal_width"])
Out[37]:
<matplotlib.collections.PathCollection at 0x11ea5a9b0>
Computing the covariance matrix
In [39]:
cov = np.cov(iris["petal_length"], iris["petal_width"])
print(cov)
[[ 3.11317942  1.29638747]
 [ 1.29638747  0.58241432]]
Here, [0,0] is the variance of data x, and [1,1] is the variance of data y.
[0,1] and [1,0] are the covariance of the above two dataset
In [45]:
sns.heatmap(cov, annot=True)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x11f0abe80>

Computing the pearson correlation coefficient

In [48]:
corr = np.corrcoef(iris["petal_length"], iris["petal_width"])
print(corr)
[[ 1.         0.9627571]
 [ 0.9627571  1.       ]]
In [49]:
sns.heatmap(corr)
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x11f4c7a20>
Another way
In [42]:
corr=iris[['petal_length','petal_width']].corr()
print(corr)
              petal_length  petal_width
petal_length      1.000000     0.962757
petal_width       0.962757     1.000000
In [44]:
sns.heatmap(corr, square=True,annot=True)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ed5bb00>

No comments :

Post a Comment