Tuesday, April 11, 2017

Text Preprocessing and Machine Learning Modeling on Text Message data


1. Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

2. Import data

In [2]:
data = pd.read_csv("spam.csv",encoding='latin-1')
In [3]:
data.head()
Out[3]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN
1 ham Ok lar... Joking wif u oni... NaN NaN NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN
3 ham U dun say so early hor... U c already then say... NaN NaN NaN
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN
Let's drop the unwanted columns, and rename the column name appropriately.
In [4]:
#Drop column and name change
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"text"})
In [5]:
data.tail()
Out[5]:
label text
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will Ì_ b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name
In [6]:
#Count observations in each label
data.label.value_counts()
Out[6]:
ham     4825
spam     747
Name: label, dtype: int64
In [7]:
# convert label to a numerical variable
data['label_num'] = data.label.map({'ham':0, 'spam':1})
In [8]:
data.head()
Out[8]:
label text label_num
0 ham Go until jurong point, crazy.. Available only ... 0
1 ham Ok lar... Joking wif u oni... 0
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1
3 ham U dun say so early hor... U c already then say... 0
4 ham Nah I don't think he goes to usf, he lives aro... 0

3. Train Test Split

Before performing text transformation, let us do train test split. Infact, we can perform k-Fold cross validation. However, due to simplicity, I am doing train test split.
In [9]:
from sklearn.model_selection import train_test_split
In [10]:
X_train,X_test,y_train,y_test = train_test_split(data["text"],data["label"], test_size = 0.2, random_state = 10)
In [11]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(4457,)
(1115,)
(4457,)
(1115,)

4.Text Transformation

Various text transformation techniques such as stop word removal, lowering the texts, tfidf transformations, prunning, stemming can be performed using sklearn.feature_extraction libraries. Then, the data can be convereted into bag-of-words.

For this problem, Let us see how our model performs without removing stop words.
In [12]:
from sklearn.feature_extraction.text import CountVectorizer
In [13]:
vect = CountVectorizer()
Note : We can also perform tfidf transformation.
In [14]:
vect.fit(X_train)
Out[14]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
vect.fit function learns the vocabulary. We can get all the feature names from vect.get_feature_names( ).

Let us print first and last twenty features
In [15]:
print(vect.get_feature_names()[0:20])
print(vect.get_feature_names()[-20:])
['00', '000', '000pes', '008704050406', '0089', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703']
['zyada', 'åð', 'åòharry', 'åòit', 'åômorrow', 'åôrents', 'ì_', 'ì¼1', 'ìä', 'ìï', 'ó_', 'û_', 'û_thanks', 'ûªm', 'ûªt', 'ûªve', 'ûï', 'ûïharry', 'ûò', 'ûówell']
In [16]:
X_train_df = vect.transform(X_train)
Now, let's transform Test data.
In [17]:
X_test_df = vect.transform(X_test)
In [18]:
type(X_test_df)
Out[18]:
scipy.sparse.csr.csr_matrix

5. Visualisations

In [19]:
ham_words = ''
spam_words = ''
spam = data[data.label_num == 1]
ham = data[data.label_num ==0]
In [20]:
import nltk
from nltk.corpus import stopwords
In [21]:
for val in spam.text:
    text = val.lower()
    tokens = nltk.word_tokenize(text)
    #tokens = [word for word in tokens if word not in stopwords.words('english')]
    for words in tokens:
        spam_words = spam_words + words + ' '
        
for val in ham.text:
    text = val.lower()
    tokens = nltk.word_tokenize(text)
    for words in tokens:
        ham_words = ham_words + words + ' '
In [22]:
from wordcloud import WordCloud
In [23]:
# Generate a word cloud image
spam_wordcloud = WordCloud(width=600, height=400).generate(spam_words)
ham_wordcloud = WordCloud(width=600, height=400).generate(ham_words)
In [24]:
#Spam Word cloud
plt.figure( figsize=(10,8), facecolor='k')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()