NLP- Multi Class Text Classification

Multi Class Text Classification-

Natural Language Processing

Multi Class Text Classification

In this problem I have used Term document Matrix. The classifier used is SVM and the kernel used is linear. Also a performance comparison of SVD+ SVM and NMF+ SVM can be analyzed by seeing the cross validation accuracy. All the machine learning library used can be found in scikit learn

Note: The performance can be improved by using TFIDF

In [38]:
import numpy as np

Load the Data

In [1]:
Trfile = open('train-data.txt', 'r+')
Vafile = open('dev-data.txt','r+')
Tefile = open('task_2_test_set_to_release.txt','r+')

Read the Data

In [2]:
Datafile = Trfile.read()
Validdata = Vafile.read()
Testfile = Tefile.read()
In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Data Preprocessing

In [20]:
def get_data(Datafile,PLAY):
    
    TList = []
    Datapre =  Datafile.split('\n')
    for ele in Datapre :
        Textlist= ele.split('\t')
        TList.append(Textlist)

    if (PLAY =="TRAIN"):
        PatientNo = []
        PatientName = []
        PatientID = []
        TrainLabel = []
        Text = []
        for i in range(0,len(TList)-1):
            PatientNo.append(TList[i][0])
            PatientName.append(TList[i][1])
            PatientID.append(TList[i][2])
            TrainLabel.append(TList[i][3])
            Text.append(TList[i][4])
            
   
        return PatientNo, PatientName,PatientID,TrainLabel,Text
    
    if (PLAY =="TEST"):
        PatientID =[]
        Text = []
        for i in range(0,len(TList)-1):
            PatientID.append(TList[i][0])
            Text.append(TList[i][1])
        return PatientID,Text
In [21]:
PatientNo, PatientName,PatientID,Trainlabel,Text = get_data(Datafile,PLAY = "TRAIN")
Valid_PatientNo, Valid_PatientName,Valid_PatientID,Valid_Trainlabel,Valid_Text = get_data(Validdata,PLAY = "TRAIN")
Test_PatientID,Test_Text = get_data(Testfile,PLAY = "TEST")

Building the Vocabulary

In [22]:
def vocabularymat(TEXTFILES,VOC,PLAY):
    
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    voc = CountVectorizer()
    voc.fit(VOC)
    if (PLAY == "TRAIN"):
        TrainMat = voc.transform(TEXTFILES)
        return TrainMat
    
    if (PLAY =="TEST"):
        TestMat = voc.transform(TEXTFILES)
        return TestMat
In [28]:
TRAINDATA = vocabularymat( Text,Text+Valid_Text+Test_Text,PLAY="TRAIN")
TESTDATA = vocabularymat(Test_Text,Text+Valid_Text+Test_Text,PLAY="TEST")
In [29]:
TESTDATA
Out[29]:
<7513x12679 sparse matrix of type '<type 'numpy.int64'>'
 with 102853 stored elements in Compressed Sparse Row format>
In [30]:
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD, NMF
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.svm import SVC

CROSS VALIDATION

In [31]:
def CROSSVALIDATION(TRAINDATA,Trainlabel,Maxnocomp,step,randomstatemax,METHOD):
    
    from sklearn.model_selection import train_test_split
    from sklearn.decomposition import TruncatedSVD, NMF
    from sklearn.model_selection import cross_val_score
    from sklearn import svm
    from sklearn.svm import SVC
    
    if (METHOD =="SVD"):
        Value = 0
        AvgAcc = 0
        
        for j in range(10,Maxnocomp,step):    
            svd = TruncatedSVD(n_components=j, n_iter=7, random_state=42)
            SVD_Matrix = svd.fit_transform(TRAINDATA)  
            for i in range(0,randomstatemax):
            
                X_train, X_test, y_train, y_test = train_test_split(SVD_Matrix,Trainlabel, test_size=0.1, random_state=i)
                clf = SVC(kernel='linear')
                clf.fit(X_train, y_train)
                Value = clf.score(X_test,y_test)*100
                AvgAcc = AvgAcc + Value
                print '-----No of Components:',j,'-----Random_State:',i
                print 'Accuracy in %', Value    
            print'#############'    
            print '--AVERAGE ACCURACY',AvgAcc/10
            AvgAcc = 0
                
    
    if (METHOD == "NMF"):
        Value = 0
        AvgAcc = 0
        for j in range(10,Maxnocomp,step):    
            MODEL = NMF(n_components=j, init='random', random_state=0)
            W= MODEL.fit_transform(TRAINDATA) 
            
            for i in range(0,randomstatemax):
            
                X_train, X_test, y_train, y_test = train_test_split(W,Trainlabel, test_size=0.1, random_state=i)
                clf = SVC(kernel='linear')
                clf.fit(X_train, y_train)
                Value = clf.score(X_test, y_test)*100
                AvgAcc = AvgAcc + Value
                print '-----No of Components:',j,'-----Random_State:',i
                print 'Accuracy in %', Value    
            print'#############'    
            print '--AVERAGE ACCURACY',AvgAcc/10
            AvgAcc = 0
In [32]:
CROSSVALIDATION(TRAINDATA,Trainlabel,300,20,10,METHOD="SVD")
-----No of Components: 10 -----Random_State: 0
Accuracy in % 52.3364485981
-----No of Components: 10 -----Random_State: 1
Accuracy in % 60.7476635514
-----No of Components: 10 -----Random_State: 2
Accuracy in % 56.0747663551
-----No of Components: 10 -----Random_State: 3
Accuracy in % 53.2710280374
-----No of Components: 10 -----Random_State: 4
Accuracy in % 62.6168224299
-----No of Components: 10 -----Random_State: 5
Accuracy in % 47.6635514019
-----No of Components: 10 -----Random_State: 6
Accuracy in % 55.1401869159
-----No of Components: 10 -----Random_State: 7
Accuracy in % 55.1401869159
-----No of Components: 10 -----Random_State: 8
Accuracy in % 49.5327102804
-----No of Components: 10 -----Random_State: 9
Accuracy in % 48.5981308411
#############
--AVERAGE ACCURACY 54.1121495327
-----No of Components: 30 -----Random_State: 0
Accuracy in % 63.5514018692
-----No of Components: 30 -----Random_State: 1
Accuracy in % 67.2897196262
-----No of Components: 30 -----Random_State: 2
Accuracy in % 60.7476635514
-----No of Components: 30 -----Random_State: 3
Accuracy in % 57.0093457944
-----No of Components: 30 -----Random_State: 4
Accuracy in % 67.2897196262
-----No of Components: 30 -----Random_State: 5
Accuracy in % 64.4859813084
-----No of Components: 30 -----Random_State: 6
Accuracy in % 60.7476635514
-----No of Components: 30 -----Random_State: 7
Accuracy in % 58.8785046729
-----No of Components: 30 -----Random_State: 8
Accuracy in % 63.5514018692
-----No of Components: 30 -----Random_State: 9
Accuracy in % 62.6168224299
#############
--AVERAGE ACCURACY 62.6168224299
-----No of Components: 50 -----Random_State: 0
Accuracy in % 57.0093457944
-----No of Components: 50 -----Random_State: 1
Accuracy in % 68.2242990654
-----No of Components: 50 -----Random_State: 2
Accuracy in % 65.4205607477
-----No of Components: 50 -----Random_State: 3
Accuracy in % 55.1401869159
-----No of Components: 50 -----Random_State: 4
Accuracy in % 65.4205607477
-----No of Components: 50 -----Random_State: 5
Accuracy in % 63.5514018692
-----No of Components: 50 -----Random_State: 6
Accuracy in % 61.6822429907
-----No of Components: 50 -----Random_State: 7
Accuracy in % 61.6822429907
-----No of Components: 50 -----Random_State: 8
Accuracy in % 66.3551401869
-----No of Components: 50 -----Random_State: 9
Accuracy in % 68.2242990654
#############
--AVERAGE ACCURACY 63.2710280374
-----No of Components: 70 -----Random_State: 0
Accuracy in % 57.0093457944
-----No of Components: 70 -----Random_State: 1
Accuracy in % 62.6168224299
-----No of Components: 70 -----Random_State: 2
Accuracy in % 61.6822429907
-----No of Components: 70 -----Random_State: 3
Accuracy in % 54.2056074766
-----No of Components: 70 -----Random_State: 4
Accuracy in % 62.6168224299
-----No of Components: 70 -----Random_State: 5
Accuracy in % 57.0093457944
-----No of Components: 70 -----Random_State: 6
Accuracy in % 61.6822429907
-----No of Components: 70 -----Random_State: 7
Accuracy in % 60.7476635514
-----No of Components: 70 -----Random_State: 8
Accuracy in % 67.2897196262
-----No of Components: 70 -----Random_State: 9
Accuracy in % 69.1588785047
#############
--AVERAGE ACCURACY 61.4018691589
-----No of Components: 90 -----Random_State: 0
Accuracy in % 59.8130841121
-----No of Components: 90 -----Random_State: 1
Accuracy in % 65.4205607477
-----No of Components: 90 -----Random_State: 2
Accuracy in % 66.3551401869
-----No of Components: 90 -----Random_State: 3
Accuracy in % 54.2056074766
-----No of Components: 90 -----Random_State: 4
Accuracy in % 66.3551401869
-----No of Components: 90 -----Random_State: 5
Accuracy in % 60.7476635514
-----No of Components: 90 -----Random_State: 6
Accuracy in % 61.6822429907
-----No of Components: 90 -----Random_State: 7
Accuracy in % 58.8785046729
-----No of Components: 90 -----Random_State: 8
Accuracy in % 66.3551401869
-----No of Components: 90 -----Random_State: 9
Accuracy in % 73.8317757009
#############
--AVERAGE ACCURACY 63.3644859813
-----No of Components: 110 -----Random_State: 0
Accuracy in % 59.8130841121
-----No of Components: 110 -----Random_State: 1
Accuracy in % 66.3551401869
-----No of Components: 110 -----Random_State: 2
Accuracy in % 62.6168224299
-----No of Components: 110 -----Random_State: 3
Accuracy in % 58.8785046729
-----No of Components: 110 -----Random_State: 4
Accuracy in % 66.3551401869
-----No of Components: 110 -----Random_State: 5
Accuracy in % 57.9439252336
-----No of Components: 110 -----Random_State: 6
Accuracy in % 65.4205607477
-----No of Components: 110 -----Random_State: 7
Accuracy in % 57.9439252336
-----No of Components: 110 -----Random_State: 8
Accuracy in % 63.5514018692
-----No of Components: 110 -----Random_State: 9
Accuracy in % 73.8317757009
#############
--AVERAGE ACCURACY 63.2710280374
-----No of Components: 130 -----Random_State: 0
Accuracy in % 63.5514018692
-----No of Components: 130 -----Random_State: 1
Accuracy in % 64.4859813084
-----No of Components: 130 -----Random_State: 2
Accuracy in % 65.4205607477
-----No of Components: 130 -----Random_State: 3
Accuracy in % 58.8785046729
-----No of Components: 130 -----Random_State: 4
Accuracy in % 66.3551401869
-----No of Components: 130 -----Random_State: 5
Accuracy in % 58.8785046729
-----No of Components: 130 -----Random_State: 6
Accuracy in % 64.4859813084
-----No of Components: 130 -----Random_State: 7
Accuracy in % 57.9439252336
-----No of Components: 130 -----Random_State: 8
Accuracy in % 66.3551401869
-----No of Components: 130 -----Random_State: 9
Accuracy in % 72.8971962617
#############
--AVERAGE ACCURACY 63.9252336449
-----No of Components: 150 -----Random_State: 0
Accuracy in % 64.4859813084
-----No of Components: 150 -----Random_State: 1
Accuracy in % 70.0934579439
-----No of Components: 150 -----Random_State: 2
Accuracy in % 67.2897196262
-----No of Components: 150 -----Random_State: 3
Accuracy in % 61.6822429907
-----No of Components: 150 -----Random_State: 4
Accuracy in % 66.3551401869
-----No of Components: 150 -----Random_State: 5
Accuracy in % 62.6168224299
-----No of Components: 150 -----Random_State: 6
Accuracy in % 65.4205607477
-----No of Components: 150 -----Random_State: 7
Accuracy in % 62.6168224299
-----No of Components: 150 -----Random_State: 8
Accuracy in % 64.4859813084
-----No of Components: 150 -----Random_State: 9
Accuracy in % 71.9626168224
#############
--AVERAGE ACCURACY 65.7009345794
-----No of Components: 170 -----Random_State: 0
Accuracy in % 57.9439252336
-----No of Components: 170 -----Random_State: 1
Accuracy in % 64.4859813084
-----No of Components: 170 -----Random_State: 2
Accuracy in % 70.0934579439
-----No of Components: 170 -----Random_State: 3
Accuracy in % 59.8130841121
-----No of Components: 170 -----Random_State: 4
Accuracy in % 65.4205607477
-----No of Components: 170 -----Random_State: 5
Accuracy in % 64.4859813084
-----No of Components: 170 -----Random_State: 6
Accuracy in % 65.4205607477
-----No of Components: 170 -----Random_State: 7
Accuracy in % 62.6168224299
-----No of Components: 170 -----Random_State: 8
Accuracy in % 70.0934579439
-----No of Components: 170 -----Random_State: 9
Accuracy in % 75.7009345794
#############
--AVERAGE ACCURACY 65.6074766355
-----No of Components: 190 -----Random_State: 0
Accuracy in % 63.5514018692
-----No of Components: 190 -----Random_State: 1
Accuracy in % 64.4859813084
-----No of Components: 190 -----Random_State: 2
Accuracy in % 71.0280373832
-----No of Components: 190 -----Random_State: 3
Accuracy in % 57.0093457944
-----No of Components: 190 -----Random_State: 4
Accuracy in % 63.5514018692
-----No of Components: 190 -----Random_State: 5
Accuracy in % 60.7476635514
-----No of Components: 190 -----Random_State: 6
Accuracy in % 69.1588785047
-----No of Components: 190 -----Random_State: 7
Accuracy in % 62.6168224299
-----No of Components: 190 -----Random_State: 8
Accuracy in % 64.4859813084
-----No of Components: 190 -----Random_State: 9
Accuracy in % 72.8971962617
#############
--AVERAGE ACCURACY 64.953271028
-----No of Components: 210 -----Random_State: 0
Accuracy in % 68.2242990654
-----No of Components: 210 -----Random_State: 1
Accuracy in % 68.2242990654
-----No of Components: 210 -----Random_State: 2
Accuracy in % 66.3551401869
-----No of Components: 210 -----Random_State: 3
Accuracy in % 58.8785046729
-----No of Components: 210 -----Random_State: 4
Accuracy in % 64.4859813084
-----No of Components: 210 -----Random_State: 5
Accuracy in % 62.6168224299
-----No of Components: 210 -----Random_State: 6
Accuracy in % 68.2242990654
-----No of Components: 210 -----Random_State: 7
Accuracy in % 68.2242990654
-----No of Components: 210 -----Random_State: 8
Accuracy in % 71.9626168224
-----No of Components: 210 -----Random_State: 9
Accuracy in % 71.9626168224
#############
--AVERAGE ACCURACY 66.9158878505
-----No of Components: 230 -----Random_State: 0
Accuracy in % 62.6168224299
-----No of Components: 230 -----Random_State: 1
Accuracy in % 65.4205607477
-----No of Components: 230 -----Random_State: 2
Accuracy in % 68.2242990654
-----No of Components: 230 -----Random_State: 3
Accuracy in % 61.6822429907
-----No of Components: 230 -----Random_State: 4
Accuracy in % 62.6168224299
-----No of Components: 230 -----Random_State: 5
Accuracy in % 62.6168224299
-----No of Components: 230 -----Random_State: 6
Accuracy in % 65.4205607477
-----No of Components: 230 -----Random_State: 7
Accuracy in % 72.8971962617
-----No of Components: 230 -----Random_State: 8
Accuracy in % 67.2897196262
-----No of Components: 230 -----Random_State: 9
Accuracy in % 74.7663551402
#############
--AVERAGE ACCURACY 66.3551401869
-----No of Components: 250 -----Random_State: 0
Accuracy in % 62.6168224299
-----No of Components: 250 -----Random_State: 1
Accuracy in % 62.6168224299
-----No of Components: 250 -----Random_State: 2
Accuracy in % 67.2897196262
-----No of Components: 250 -----Random_State: 3
Accuracy in % 57.0093457944
-----No of Components: 250 -----Random_State: 4
Accuracy in % 61.6822429907
-----No of Components: 250 -----Random_State: 5
Accuracy in % 56.0747663551
-----No of Components: 250 -----Random_State: 6
Accuracy in % 66.3551401869
-----No of Components: 250 -----Random_State: 7
Accuracy in % 67.2897196262
-----No of Components: 250 -----Random_State: 8
Accuracy in % 65.4205607477
-----No of Components: 250 -----Random_State: 9
Accuracy in % 71.0280373832
#############
--AVERAGE ACCURACY 63.738317757
-----No of Components: 270 -----Random_State: 0
Accuracy in % 64.4859813084
-----No of Components: 270 -----Random_State: 1
Accuracy in % 57.0093457944
-----No of Components: 270 -----Random_State: 2
Accuracy in % 64.4859813084
-----No of Components: 270 -----Random_State: 3
Accuracy in % 56.0747663551
-----No of Components: 270 -----Random_State: 4
Accuracy in % 64.4859813084
-----No of Components: 270 -----Random_State: 5
Accuracy in % 63.5514018692
-----No of Components: 270 -----Random_State: 6
Accuracy in % 65.4205607477
-----No of Components: 270 -----Random_State: 7
Accuracy in % 71.0280373832
-----No of Components: 270 -----Random_State: 8
Accuracy in % 66.3551401869
-----No of Components: 270 -----Random_State: 9
Accuracy in % 73.8317757009
#############
--AVERAGE ACCURACY 64.6728971963
-----No of Components: 290 -----Random_State: 0
Accuracy in % 64.4859813084
-----No of Components: 290 -----Random_State: 1
Accuracy in % 60.7476635514
-----No of Components: 290 -----Random_State: 2
Accuracy in % 66.3551401869
-----No of Components: 290 -----Random_State: 3
Accuracy in % 60.7476635514
-----No of Components: 290 -----Random_State: 4
Accuracy in % 60.7476635514
-----No of Components: 290 -----Random_State: 5
Accuracy in % 67.2897196262
-----No of Components: 290 -----Random_State: 6
Accuracy in % 66.3551401869
-----No of Components: 290 -----Random_State: 7
Accuracy in % 67.2897196262
-----No of Components: 290 -----Random_State: 8
Accuracy in % 64.4859813084
-----No of Components: 290 -----Random_State: 9
Accuracy in % 71.0280373832
#############
--AVERAGE ACCURACY 64.953271028

Choosing the model based on the hyperparameter obtained after cross validation

In [33]:
svd = TruncatedSVD(n_components=210, n_iter=7, random_state=42)
SVD_Matrix = svd.fit_transform(TRAINDATA)  

clf = SVC(kernel='linear')
clf.fit(SVD_Matrix, Trainlabel)
Out[33]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Training Accuracy

In [34]:
print 'Accuracy in %', clf.score(SVD_Matrix, Trainlabel)*100
Accuracy in % 85.3521126761

To Save the Model(SVD+SVM)

In [39]:
import pickle
from sklearn.externals import joblib
joblib.dump(clf, 'model_svd_svm_nlp.pkl') 
Out[39]:
['model_svd_svm_nlp.pkl']

Load the model

In [40]:
model = joblib.load('model_svd_svm_nlp.pkl') 

Testing with the trained Model

In [41]:
svd = TruncatedSVD(n_components=210, n_iter=7, random_state=42)
SVD_Matrix_Test = svd.fit_transform(TESTDATA) 
#print SVD_Matrix.shape
#print SVD_Matrix_Test.shape
predicted = model.predict(SVD_Matrix_Test)
predicted = predicted.astype(int)
np.savetxt('predictedlabel_svd_svm_nlp.txt',predicted)

To Know the count of the data in each class

In [ ]:
PredictedList=predicted.tolist()
print PredictedList.count(1)
print PredictedList.count(2)
print PredictedList.count(3)

2nd Method NMF + SVM

In [43]:
CROSSVALIDATION(TRAINDATA,Trainlabel,300,20,10,METHOD="NMF")
-----No of Components: 10 -----Random_State: 0
Accuracy in % 45.7943925234
-----No of Components: 10 -----Random_State: 1
Accuracy in % 50.4672897196
-----No of Components: 10 -----Random_State: 2
Accuracy in % 46.7289719626
-----No of Components: 10 -----Random_State: 3
Accuracy in % 53.2710280374
-----No of Components: 10 -----Random_State: 4
Accuracy in % 44.8598130841
-----No of Components: 10 -----Random_State: 5
Accuracy in % 50.4672897196
-----No of Components: 10 -----Random_State: 6
Accuracy in % 44.8598130841
-----No of Components: 10 -----Random_State: 7
Accuracy in % 44.8598130841
-----No of Components: 10 -----Random_State: 8
Accuracy in % 47.6635514019
-----No of Components: 10 -----Random_State: 9
Accuracy in % 46.7289719626
#############
--AVERAGE ACCURACY 47.5700934579
-----No of Components: 30 -----Random_State: 0
Accuracy in % 46.7289719626
-----No of Components: 30 -----Random_State: 1
Accuracy in % 51.4018691589
-----No of Components: 30 -----Random_State: 2
Accuracy in % 49.5327102804
-----No of Components: 30 -----Random_State: 3
Accuracy in % 55.1401869159
-----No of Components: 30 -----Random_State: 4
Accuracy in % 44.8598130841
-----No of Components: 30 -----Random_State: 5
Accuracy in % 52.3364485981
-----No of Components: 30 -----Random_State: 6
Accuracy in % 44.8598130841
-----No of Components: 30 -----Random_State: 7
Accuracy in % 43.9252336449
-----No of Components: 30 -----Random_State: 8
Accuracy in % 48.5981308411
-----No of Components: 30 -----Random_State: 9
Accuracy in % 48.5981308411
#############
--AVERAGE ACCURACY 48.5981308411
-----No of Components: 50 -----Random_State: 0
Accuracy in % 48.5981308411
-----No of Components: 50 -----Random_State: 1
Accuracy in % 56.0747663551
-----No of Components: 50 -----Random_State: 2
Accuracy in % 49.5327102804
-----No of Components: 50 -----Random_State: 3
Accuracy in % 57.9439252336
-----No of Components: 50 -----Random_State: 4
Accuracy in % 48.5981308411
-----No of Components: 50 -----Random_State: 5
Accuracy in % 51.4018691589
-----No of Components: 50 -----Random_State: 6
Accuracy in % 45.7943925234
-----No of Components: 50 -----Random_State: 7
Accuracy in % 44.8598130841
-----No of Components: 50 -----Random_State: 8
Accuracy in % 48.5981308411
-----No of Components: 50 -----Random_State: 9
Accuracy in % 47.6635514019
#############
--AVERAGE ACCURACY 49.9065420561
-----No of Components: 70 -----Random_State: 0
Accuracy in % 51.4018691589
-----No of Components: 70 -----Random_State: 1
Accuracy in % 57.9439252336
-----No of Components: 70 -----Random_State: 2
Accuracy in % 53.2710280374
-----No of Components: 70 -----Random_State: 3
Accuracy in % 58.8785046729
-----No of Components: 70 -----Random_State: 4
Accuracy in % 51.4018691589
-----No of Components: 70 -----Random_State: 5
Accuracy in % 53.2710280374
-----No of Components: 70 -----Random_State: 6
Accuracy in % 46.7289719626
-----No of Components: 70 -----Random_State: 7
Accuracy in % 50.4672897196
-----No of Components: 70 -----Random_State: 8
Accuracy in % 53.2710280374
-----No of Components: 70 -----Random_State: 9
Accuracy in % 49.5327102804
#############
--AVERAGE ACCURACY 52.6168224299
-----No of Components: 90 -----Random_State: 0
Accuracy in % 51.4018691589
-----No of Components: 90 -----Random_State: 1
Accuracy in % 57.9439252336
-----No of Components: 90 -----Random_State: 2
Accuracy in % 51.4018691589
-----No of Components: 90 -----Random_State: 3
Accuracy in % 58.8785046729
-----No of Components: 90 -----Random_State: 4
Accuracy in % 52.3364485981
-----No of Components: 90 -----Random_State: 5
Accuracy in % 53.2710280374
-----No of Components: 90 -----Random_State: 6
Accuracy in % 42.9906542056
-----No of Components: 90 -----Random_State: 7
Accuracy in % 49.5327102804
-----No of Components: 90 -----Random_State: 8
Accuracy in % 54.2056074766
-----No of Components: 90 -----Random_State: 9
Accuracy in % 46.7289719626
#############
--AVERAGE ACCURACY 51.8691588785
-----No of Components: 110 -----Random_State: 0
Accuracy in % 55.1401869159
-----No of Components: 110 -----Random_State: 1
Accuracy in % 62.6168224299
-----No of Components: 110 -----Random_State: 2
Accuracy in % 53.2710280374
-----No of Components: 110 -----Random_State: 3
Accuracy in % 59.8130841121
-----No of Components: 110 -----Random_State: 4
Accuracy in % 54.2056074766
-----No of Components: 110 -----Random_State: 5
Accuracy in % 55.1401869159
-----No of Components: 110 -----Random_State: 6
Accuracy in % 52.3364485981
-----No of Components: 110 -----Random_State: 7
Accuracy in % 55.1401869159
-----No of Components: 110 -----Random_State: 8
Accuracy in % 57.0093457944
-----No of Components: 110 -----Random_State: 9
Accuracy in % 51.4018691589
#############
--AVERAGE ACCURACY 55.6074766355
-----No of Components: 130 -----Random_State: 0
Accuracy in % 50.4672897196
-----No of Components: 130 -----Random_State: 1
Accuracy in % 57.9439252336
-----No of Components: 130 -----Random_State: 2
Accuracy in % 52.3364485981
-----No of Components: 130 -----Random_State: 3
Accuracy in % 58.8785046729
-----No of Components: 130 -----Random_State: 4
Accuracy in % 55.1401869159
-----No of Components: 130 -----Random_State: 5
Accuracy in % 55.1401869159
-----No of Components: 130 -----Random_State: 6
Accuracy in % 44.8598130841
-----No of Components: 130 -----Random_State: 7
Accuracy in % 51.4018691589
-----No of Components: 130 -----Random_State: 8
Accuracy in % 51.4018691589
-----No of Components: 130 -----Random_State: 9
Accuracy in % 58.8785046729
#############
--AVERAGE ACCURACY 53.6448598131
-----No of Components: 150 -----Random_State: 0
Accuracy in % 48.5981308411
-----No of Components: 150 -----Random_State: 1
Accuracy in % 56.0747663551
-----No of Components: 150 -----Random_State: 2
Accuracy in % 48.5981308411
-----No of Components: 150 -----Random_State: 3
Accuracy in % 57.0093457944
-----No of Components: 150 -----Random_State: 4
Accuracy in % 47.6635514019
-----No of Components: 150 -----Random_State: 5
Accuracy in % 52.3364485981
-----No of Components: 150 -----Random_State: 6
Accuracy in % 45.7943925234
-----No of Components: 150 -----Random_State: 7
Accuracy in % 46.7289719626
-----No of Components: 150 -----Random_State: 8
Accuracy in % 48.5981308411
-----No of Components: 150 -----Random_State: 9
Accuracy in % 52.3364485981
#############
--AVERAGE ACCURACY 50.3738317757
-----No of Components: 170 -----Random_State: 0
Accuracy in % 49.5327102804
-----No of Components: 170 -----Random_State: 1
Accuracy in % 55.1401869159
-----No of Components: 170 -----Random_State: 2
Accuracy in % 50.4672897196
-----No of Components: 170 -----Random_State: 3
Accuracy in % 57.0093457944
-----No of Components: 170 -----Random_State: 4
Accuracy in % 48.5981308411
-----No of Components: 170 -----Random_State: 5
Accuracy in % 52.3364485981
-----No of Components: 170 -----Random_State: 6
Accuracy in % 43.9252336449
-----No of Components: 170 -----Random_State: 7
Accuracy in % 44.8598130841
-----No of Components: 170 -----Random_State: 8
Accuracy in % 49.5327102804
-----No of Components: 170 -----Random_State: 9
Accuracy in % 45.7943925234
#############
--AVERAGE ACCURACY 49.7196261682
-----No of Components: 190 -----Random_State: 0
Accuracy in % 46.7289719626
-----No of Components: 190 -----Random_State: 1
Accuracy in % 54.2056074766
-----No of Components: 190 -----Random_State: 2
Accuracy in % 50.4672897196
-----No of Components: 190 -----Random_State: 3
Accuracy in % 56.0747663551
-----No of Components: 190 -----Random_State: 4
Accuracy in % 48.5981308411
-----No of Components: 190 -----Random_State: 5
Accuracy in % 51.4018691589
-----No of Components: 190 -----Random_State: 6
Accuracy in % 42.9906542056
-----No of Components: 190 -----Random_State: 7
Accuracy in % 46.7289719626
-----No of Components: 190 -----Random_State: 8
Accuracy in % 50.4672897196
-----No of Components: 190 -----Random_State: 9
Accuracy in % 48.5981308411
#############
--AVERAGE ACCURACY 49.6261682243
-----No of Components: 210 -----Random_State: 0
Accuracy in % 49.5327102804
-----No of Components: 210 -----Random_State: 1
Accuracy in % 56.0747663551
-----No of Components: 210 -----Random_State: 2
Accuracy in % 48.5981308411
-----No of Components: 210 -----Random_State: 3
Accuracy in % 57.0093457944
-----No of Components: 210 -----Random_State: 4
Accuracy in % 47.6635514019
-----No of Components: 210 -----Random_State: 5
Accuracy in % 52.3364485981
-----No of Components: 210 -----Random_State: 6
Accuracy in % 47.6635514019
-----No of Components: 210 -----Random_State: 7
Accuracy in % 47.6635514019
-----No of Components: 210 -----Random_State: 8
Accuracy in % 51.4018691589
-----No of Components: 210 -----Random_State: 9
Accuracy in % 53.2710280374
#############
--AVERAGE ACCURACY 51.1214953271
-----No of Components: 230 -----Random_State: 0
Accuracy in % 57.0093457944
-----No of Components: 230 -----Random_State: 1
Accuracy in % 57.0093457944
-----No of Components: 230 -----Random_State: 2
Accuracy in % 55.1401869159
-----No of Components: 230 -----Random_State: 3
Accuracy in % 57.0093457944
-----No of Components: 230 -----Random_State: 4
Accuracy in % 57.9439252336
-----No of Components: 230 -----Random_State: 5
Accuracy in % 56.0747663551
-----No of Components: 230 -----Random_State: 6
Accuracy in % 49.5327102804
-----No of Components: 230 -----Random_State: 7
Accuracy in % 51.4018691589
-----No of Components: 230 -----Random_State: 8
Accuracy in % 51.4018691589
-----No of Components: 230 -----Random_State: 9
Accuracy in % 57.9439252336
#############
--AVERAGE ACCURACY 55.046728972
-----No of Components: 250 -----Random_State: 0
Accuracy in % 56.0747663551
-----No of Components: 250 -----Random_State: 1
Accuracy in % 57.9439252336
-----No of Components: 250 -----Random_State: 2
Accuracy in % 54.2056074766
-----No of Components: 250 -----Random_State: 3
Accuracy in % 58.8785046729
-----No of Components: 250 -----Random_State: 4
Accuracy in % 58.8785046729
-----No of Components: 250 -----Random_State: 5
Accuracy in % 53.2710280374
-----No of Components: 250 -----Random_State: 6
Accuracy in % 49.5327102804
-----No of Components: 250 -----Random_State: 7
Accuracy in % 52.3364485981
-----No of Components: 250 -----Random_State: 8
Accuracy in % 51.4018691589
-----No of Components: 250 -----Random_State: 9
Accuracy in % 58.8785046729
#############
--AVERAGE ACCURACY 55.1401869159
-----No of Components: 270 -----Random_State: 0
Accuracy in % 53.2710280374
-----No of Components: 270 -----Random_State: 1
Accuracy in % 58.8785046729
-----No of Components: 270 -----Random_State: 2
Accuracy in % 48.5981308411
-----No of Components: 270 -----Random_State: 3
Accuracy in % 55.1401869159
-----No of Components: 270 -----Random_State: 4
Accuracy in % 51.4018691589
-----No of Components: 270 -----Random_State: 5
Accuracy in % 53.2710280374
-----No of Components: 270 -----Random_State: 6
Accuracy in % 48.5981308411
-----No of Components: 270 -----Random_State: 7
Accuracy in % 45.7943925234
-----No of Components: 270 -----Random_State: 8
Accuracy in % 52.3364485981
-----No of Components: 270 -----Random_State: 9
Accuracy in % 55.1401869159
#############
--AVERAGE ACCURACY 52.2429906542
-----No of Components: 290 -----Random_State: 0
Accuracy in % 47.6635514019
-----No of Components: 290 -----Random_State: 1
Accuracy in % 57.0093457944
-----No of Components: 290 -----Random_State: 2
Accuracy in % 51.4018691589
-----No of Components: 290 -----Random_State: 3
Accuracy in % 57.9439252336
-----No of Components: 290 -----Random_State: 4
Accuracy in % 49.5327102804
-----No of Components: 290 -----Random_State: 5
Accuracy in % 49.5327102804
-----No of Components: 290 -----Random_State: 6
Accuracy in % 45.7943925234
-----No of Components: 290 -----Random_State: 7
Accuracy in % 46.7289719626
-----No of Components: 290 -----Random_State: 8
Accuracy in % 53.2710280374
-----No of Components: 290 -----Random_State: 9
Accuracy in % 54.2056074766
#############
--AVERAGE ACCURACY 51.308411215

Building the model (From Cross Validation , we fix the value of N_components in NMF)

In [44]:
MODEL = NMF(n_components=250, init='random', random_state=0)
W= MODEL.fit_transform(TRAINDATA) 
clf = SVC(kernel='linear')
clf.fit(W, Trainlabel)
Out[44]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Training Accuracy

In [45]:
print 'Accuracy in %', clf.score(W, Trainlabel)*100 
Accuracy in % 57.1830985915

To Save the model (NMF +SVM)

In [46]:
import pickle
from sklearn.externals import joblib
joblib.dump(clf, 'model_nmf_svm_nlp.pkl') 
Out[46]:
['model_nmf_svm_nlp.pkl']

Load the Model

In [47]:
model = joblib.load('model_nmf_svm_nlp.pkl') 
MODELTEST = NMF(n_components=250, init='random', random_state=0)
WTEST= MODELTEST.fit_transform(TESTDATA)
predictednmf = model.predict(WTEST)
predictednmf = predictednmf.astype(int)
np.savetxt('predictedlabel_nmf_svm_nlp.txt',predictednmf)
In [328]:
PredictedList=predictednmf.tolist()
print PredictedList.count(1)
print PredictedList.count(2)
print PredictedList.count(3)
111
223
7179

Comments