1. Import libraries¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'
2. Import data¶
In [2]:
data = pd.read_csv("spam.csv",encoding='latin-1')
In [3]:
data.head()
Out[3]:
Let's drop the unwanted columns, and rename the column name appropriately.
In [4]:
#Drop column and name change
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"text"})
In [5]:
data.tail()
Out[5]:
In [6]:
#Count observations in each label
data.label.value_counts()
Out[6]:
In [7]:
# convert label to a numerical variable
data['label_num'] = data.label.map({'ham':0, 'spam':1})
In [8]:
data.head()
Out[8]:
3. Train Test Split¶
Before performing text transformation, let us do train test split. Infact, we can perform k-Fold cross validation. However, due to simplicity, I am doing train test split.
In [9]:
from sklearn.model_selection import train_test_split
In [10]:
X_train,X_test,y_train,y_test = train_test_split(data["text"],data["label"], test_size = 0.2, random_state = 10)
In [11]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
4.Text Transformation¶
Various text transformation techniques such as stop word removal, lowering the texts, tfidf transformations, prunning, stemming can be performed using sklearn.feature_extraction libraries. Then, the data can be convereted into bag-of-words.For this problem, Let us see how our model performs without removing stop words.
In [12]:
from sklearn.feature_extraction.text import CountVectorizer
In [13]:
vect = CountVectorizer()
Note : We can also perform tfidf transformation.
In [14]:
vect.fit(X_train)
Out[14]:
vect.fit function learns the vocabulary. We can get all the feature names from vect.get_feature_names( ).
Let us print first and last twenty features
Let us print first and last twenty features
In [15]:
print(vect.get_feature_names()[0:20])
print(vect.get_feature_names()[-20:])
In [16]:
X_train_df = vect.transform(X_train)
Now, let's transform Test data.
In [17]:
X_test_df = vect.transform(X_test)
In [18]:
type(X_test_df)
Out[18]:
5. Visualisations¶
In [19]:
ham_words = ''
spam_words = ''
spam = data[data.label_num == 1]
ham = data[data.label_num ==0]
In [20]:
import nltk
from nltk.corpus import stopwords
In [21]:
for val in spam.text:
text = val.lower()
tokens = nltk.word_tokenize(text)
#tokens = [word for word in tokens if word not in stopwords.words('english')]
for words in tokens:
spam_words = spam_words + words + ' '
for val in ham.text:
text = val.lower()
tokens = nltk.word_tokenize(text)
for words in tokens:
ham_words = ham_words + words + ' '
In [22]:
from wordcloud import WordCloud
In [23]:
# Generate a word cloud image
spam_wordcloud = WordCloud(width=600, height=400).generate(spam_words)
ham_wordcloud = WordCloud(width=600, height=400).generate(ham_words)
In [24]:
#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='k')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()