Saturday, December 3, 2016

Basic Information Retrieval/Search using Vector Space Model


a)  Description of my source code [Python] :

In this project, I build a simple information retrieval system using vector space model. I have used the popular IR data set (cranfield document collection). This project supports up to three search keywords.

Python version : 3.5



1)    def process_file(all_files):

This file removes all the SGML tags from the files, and pass the remaining words to the add_map() function, where the processed words add to the hashmap
Input : Each corpus data with SGML tags

2)    def add_map(data) :

It does the text transformation and adds all the data words to the hashmap.
Input : Each File data contents
Output : Hashmap contains token,count

3)    def remove_stopwords_from_map(word_map,stop_words):

This function removes the stop words from the hashmap.
Input : word_map : Hashmap of all the tokens
Input: stop_words : List of all the stop words.

4)    def build_inverted_index(word_map, all_files):
This function takes all the file names, and the word hashmap as an input. This function process all the documents, one by one, and builds an inverted index for each word in the documents. Once, inverted index is ready, my program asks the user to input the search query.  Then the search query is passed to the search_query(key_words) function
Input:
word_map - Hashmap contains token after removing stop words
all_files : all file names(1400 files)

5)    def top_words(word_map):

This function sorts the hashmap, based on the value, and it returns the top 10 values as a rank. It also prints the number of related documents based on the query to the user.

6)    def generate_tfidf(all_files , position_map):

This function takes the top 1000 most frequent words and generate the tf-idf matrix for 1400 files. It returns tfidf matrix.

7)    def transform_query_tfidf(query,position_map):

It takes the query and index position of top 1000 words as an input, and it transform it to tfidf values.
 

8)    def generate_results_t2(tfidf, query_tfidf):

This function takes tfidf matrix of 1400 documents, and query_tfidf as an input, and it calculates the cosine similarity between the query against all those 1400 documents. Then, it arranges the ranking based on the highest cosine similarity first.




9)    def generate_results_t3(tfidf, query_tfidf, rank_map ):

This function takes tfidf matrix, query_tfidf values, and rank_map, which is a index map of only query words and it’s documents. It improves the search faster.

10)  def process_results(query, tfidf):

This function calls T2 and T3, then it process the results to display the top 10 search query, and it calculates the execution time over 10 times, and it displays the result to the user.

User Interactions:

In my program, users are allowed to give input the following two things

1)    Users can pass the corpus folder name
2)    Users can enter their search query

Libraries used:

I did not used any external IR library for my work. However, I used the following default python libraries to process system manipulations and calculations work.

1.     os – reading files from a folder
2.     time – calculate the running time of a function call
3.     re – regular expression to remove sgml tags
4.     math – calculate log , sqrt values for cosine similarity and tfidf transformations.

b) The size of vector space model and its ratio in comparison to the size of the original corpus:

I dumped my Vector space model to a text file. It shows that it’s size as 5.2 mb on disk .
The size of original corpus (1400 files) is 5.8 mb on disk (Mac computer) . So, it’s ratio is 5.2/5.8
However, size on disk depends on various parameters.

c)  Results


Query 1: “ clear survey duration”
T2 Results
*************
Rank-1: cranfield 143 : 0.43292292527837434
Rank-2: cranfield 898 : 0.37523281454087926
Rank-3: cranfield 892 : 0.2817757017318109
Rank-4: cranfield 378 : 0.27174246466052293
Rank-5: cranfield 632 : 0.2705221582864765
Rank-6: cranfield 1170 : 0.2389120540160786
Rank-7: cranfield 93 : 0.23774471738798286
Rank-8: cranfield 1310 : 0.20799827962656017
Rank-9: cranfield 743 : 0.16245456307538253
Rank-10: cranfield 136 : 0.14231114695855476
---------------
Total number of related documents 21
---------------
0.4031250476837158
0.3755300045013428
0.37288594245910645
0.3897981643676758
0.37694787979125977
0.3704710006713867
0.3843810558319092
0.37485289573669434
0.37755298614501953
0.3985710144042969
Average Execution time T2 0.3824115991592407
T3 Results
*************
Rank-1: cranfield 143 : 0.43292292527837434
Rank-2: cranfield 898 : 0.37523281454087926
Rank-3: cranfield 892 : 0.2817757017318109
Rank-4: cranfield 378 : 0.27174246466052293
Rank-5: cranfield 632 : 0.2705221582864765
Rank-6: cranfield 1170 : 0.2389120540160786
Rank-7: cranfield 93 : 0.23774471738798286
Rank-8: cranfield 1310 : 0.20799827962656017
Rank-9: cranfield 743 : 0.16245456307538253
Rank-10: cranfield 136 : 0.14231114695855476
---------------
Total number of related documents 21
---------------
0.01220703125
0.01158285140991211
0.011934995651245117
0.012199878692626953
0.012449979782104492
0.01160883903503418
0.014822959899902344
0.012479066848754883
0.012227058410644531
0.011275053024291992
Average Execution time T3 0.01227877140045166

Query 2: “equation simple conditions”

please enter your search query...equation simple conditions
T2 Results
*************
Rank-1: cranfield 356 : 0.29379601216527984
Rank-2: cranfield 637 : 0.27391995352001974
Rank-3: cranfield 987 : 0.27027337255457623
Rank-4: cranfield 936 : 0.2640905516879408
Rank-5: cranfield 115 : 0.2636739088301488
Rank-6: cranfield 956 : 0.24073135894787517
Rank-7: cranfield 190 : 0.23818227820192306
Rank-8: cranfield 377 : 0.23435833714672133
Rank-9: cranfield 931 : 0.23223202349984512
Rank-10: cranfield 32 : 0.23210751978721522
---------------
Total number of related documents 531
---------------
0.4186129570007324
0.3894839286804199
0.39535999298095703
0.38993120193481445
0.4131910800933838
0.38707709312438965
0.388077974319458
0.3898200988769531
0.38815903663635254
0.46410679817199707
Average Execution time T2 0.4023820161819458
T3 Results
*************
Rank-1: cranfield 356 : 0.29379601216527984
Rank-2: cranfield 637 : 0.27391995352001974
Rank-3: cranfield 987 : 0.27027337255457623
Rank-4: cranfield 936 : 0.2640905516879408
Rank-5: cranfield 115 : 0.2636739088301488
Rank-6: cranfield 956 : 0.24073135894787517
Rank-7: cranfield 190 : 0.23818227820192306
Rank-8: cranfield 377 : 0.23435833714672133
Rank-9: cranfield 931 : 0.23223202349984512
Rank-10: cranfield 32 : 0.23210751978721522
---------------
Total number of related documents 531
---------------
0.2026369571685791
0.1967639923095703
0.18050193786621094
0.18079710006713867
0.18939900398254395
0.1811840534210205
0.1868269443511963
0.21831607818603516
0.23627495765686035
0.2157609462738037
Average Execution time T3 0.1988461971282959

Query 3:flow method analysis

please enter your search query...flow method analysis
T2 Results
*************
Rank-1: cranfield 984 : 0.37638717676065275
Rank-2: cranfield 1384 : 0.2572367282399113
Rank-3: cranfield 1087 : 0.23959621484856067
Rank-4: cranfield 1386 : 0.22563192832186038
Rank-5: cranfield 756 : 0.21819473990519855
Rank-6: cranfield 264 : 0.21758000127858096
Rank-7: cranfield 990 : 0.21485385151484018
Rank-8: cranfield 648 : 0.20419704910258707
Rank-9: cranfield 1367 : 0.19711086189585206
Rank-10: cranfield 775 : 0.18921080840594046
---------------
Total number of related documents 920
---------------
0.42834019660949707
0.3946080207824707
0.4228990077972412
0.4098498821258545
0.4058959484100342
0.41594696044921875
0.4036121368408203
0.42835092544555664
0.40869593620300293
0.4020969867706299
Average Execution time T2 0.4120296001434326
T3 Results
*************
Rank-1: cranfield 984 : 0.37638717676065275
Rank-2: cranfield 1384 : 0.2572367282399113
Rank-3: cranfield 1087 : 0.23959621484856067
Rank-4: cranfield 1386 : 0.22563192832186038
Rank-5: cranfield 756 : 0.21819473990519855
Rank-6: cranfield 264 : 0.21758000127858096
Rank-7: cranfield 990 : 0.21485385151484018
Rank-8: cranfield 648 : 0.20419704910258707
Rank-9: cranfield 1367 : 0.19711086189585206
Rank-10: cranfield 775 : 0.18921080840594046
---------------
Total number of related documents 920
---------------
0.3422510623931885
0.32555580139160156
0.3550689220428467
0.377838134765625
0.3330559730529785
0.33521103858947754
0.3388819694519043
0.3463459014892578
0.3318021297454834
0.3388700485229492
Average Execution time T3 0.34248809814453124

Query 4 : “governing equations”

please enter your search query...governing equations
T2 Results
*************
Rank-1: cranfield 180 : 0.44383181762947965
Rank-2: cranfield 298 : 0.32698407600175483
Rank-3: cranfield 737 : 0.22822596231604902
Rank-4: cranfield 639 : 0.2197615048669502
Rank-5: cranfield 1240 : 0.21309867438483113
Rank-6: cranfield 374 : 0.2069559601367719
Rank-7: cranfield 362 : 0.2064477945868797
Rank-8: cranfield 590 : 0.19683594983620756
Rank-9: cranfield 1245 : 0.17012834868656784
Rank-10: cranfield 1233 : 0.15756079656165212
---------------
Total number of related documents 294
---------------
0.4385688304901123
0.4071469306945801
0.39668798446655273
0.4025897979736328
0.40557384490966797
0.42664504051208496
0.405059814453125
0.41585803031921387
0.39579200744628906
0.40870118141174316
Average Execution time T2 0.4102623462677002
T3 Results
*************
Rank-1: cranfield 180 : 0.44383181762947965
Rank-2: cranfield 298 : 0.32698407600175483
Rank-3: cranfield 737 : 0.22822596231604902
Rank-4: cranfield 639 : 0.2197615048669502
Rank-5: cranfield 1240 : 0.21309867438483113
Rank-6: cranfield 374 : 0.2069559601367719
Rank-7: cranfield 362 : 0.2064477945868797
Rank-8: cranfield 590 : 0.19683594983620756
Rank-9: cranfield 1245 : 0.17012834868656784
Rank-10: cranfield 1233 : 0.15756079656165212
---------------
Total number of related documents 294
---------------
0.11075496673583984
0.12650394439697266
0.1074061393737793
0.10495901107788086
0.10561299324035645
0.10079002380371094
0.10472583770751953
0.10688400268554688
0.10442399978637695
0.10543012619018555
Average Execution time T3 0.1077491044998169
Do you want to continue : Y/N ?....

Query 5: “heated element”

Enter your search query ... heated element
please enter your search query...heated element
T2 Results
*************
Rank-1: cranfield 8 : 0.4665384859565601
Rank-2: cranfield 13 : 0.4542617554466098
Rank-3: cranfield 96 : 0.43106985174883966
Rank-4: cranfield 135 : 0.38677609642392513
Rank-5: cranfield 835 : 0.36480935735760384
Rank-6: cranfield 1268 : 0.2892231040067274
Rank-7: cranfield 7 : 0.2866660361642518
Rank-8: cranfield 122 : 0.227111193479825
Rank-9: cranfield 509 : 0.2221391675801983
Rank-10: cranfield 1294 : 0.20467402309728328
---------------
Total number of related documents 38
---------------
0.40689706802368164
0.3818209171295166
0.4345400333404541
0.4057309627532959
0.3835599422454834
0.3773460388183594
0.38951706886291504
0.37604308128356934
0.4043550491333008
0.38384199142456055
Average Execution time T2 0.3943652153015137
T3 Results
*************
Rank-1: cranfield 8 : 0.4665384859565601
Rank-2: cranfield 13 : 0.4542617554466098
Rank-3: cranfield 96 : 0.43106985174883966
Rank-4: cranfield 135 : 0.38677609642392513
Rank-5: cranfield 835 : 0.36480935735760384
Rank-6: cranfield 1268 : 0.2892231040067274
Rank-7: cranfield 7 : 0.2866660361642518
Rank-8: cranfield 122 : 0.227111193479825
Rank-9: cranfield 509 : 0.2221391675801983
Rank-10: cranfield 1294 : 0.20467402309728328
---------------
Total number of related documents 38
---------------
0.01381993293762207
0.013023138046264648
0.013969898223876953
0.01398611068725586
0.013659000396728516
0.014817953109741211
0.01394510269165039
0.013763904571533203
0.013024091720581055
0.014100074768066406
Average Execution time T3 0.01381092071533203
Do you want to continue : Y/N ?....

Query 6: “layer assumptions”

please enter your search query...layer assumptions
T2 Results
*************
Rank-1: cranfield 499 : 0.2558707588253873
Rank-2: cranfield 1031 : 0.23910277856273968
Rank-3: cranfield 539 : 0.20548199178724574
Rank-4: cranfield 489 : 0.1969338609390281
Rank-5: cranfield 327 : 0.18164902013108386
Rank-6: cranfield 928 : 0.17669426223053705
Rank-7: cranfield 195 : 0.17534591466254595
Rank-8: cranfield 306 : 0.17361580327828752
Rank-9: cranfield 449 : 0.17225757455472393
Rank-10: cranfield 1320 : 0.17148058214878717
---------------
Total number of related documents 414
---------------
0.4137279987335205
0.4083528518676758
0.39133715629577637
0.4147059917449951
0.40894293785095215
0.39607882499694824
0.3984508514404297
0.39461398124694824
0.4341270923614502
0.4154331684112549
Average Execution time T2 0.4075770854949951
T3 Results
*************
Rank-1: cranfield 499 : 0.2558707588253873
Rank-2: cranfield 1031 : 0.23910277856273968
Rank-3: cranfield 539 : 0.20548199178724574
Rank-4: cranfield 489 : 0.1969338609390281
Rank-5: cranfield 327 : 0.18164902013108386
Rank-6: cranfield 928 : 0.17669426223053705
Rank-7: cranfield 195 : 0.17534591466254595
Rank-8: cranfield 306 : 0.17361580327828752
Rank-9: cranfield 449 : 0.17225757455472393
Rank-10: cranfield 1320 : 0.17148058214878717
---------------
Total number of related documents 414
---------------
0.16916179656982422
0.15564298629760742
0.18361997604370117
0.20829010009765625
0.16690707206726074
0.15019893646240234
0.1474289894104004
0.17515802383422852
0.15147113800048828
0.14196300506591797
Average Execution time T3 0.16498420238494874
Do you want to continue : Y/N ?....

Query 7 : “flow”

please enter your search query...flow
T2 Results
*************
Rank-1: cranfield 775 : 0.3036240609023336
Rank-2: cranfield 97 : 0.27541148045561853
Rank-3: cranfield 404 : 0.24792646750217778
Rank-4: cranfield 1245 : 0.22095557580089653
Rank-5: cranfield 1081 : 0.21840296451586416
Rank-6: cranfield 984 : 0.21675925669921514
Rank-7: cranfield 242 : 0.21405770165762125
Rank-8: cranfield 660 : 0.21266127263494033
Rank-9: cranfield 148 : 0.20723166010960398
Rank-10: cranfield 310 : 0.19908518802871253
---------------
Total number of related documents 647
---------------
0.4213838577270508
0.41815805435180664
0.390963077545166
0.40816402435302734
0.42156410217285156
0.4006001949310303
0.41673994064331055
0.3968667984008789
0.4012148380279541
0.40119004249572754
Average Execution time T2 0.40768449306488036
T3 Results
*************
Rank-1: cranfield 775 : 0.3036240609023336
Rank-2: cranfield 97 : 0.27541148045561853
Rank-3: cranfield 404 : 0.24792646750217778
Rank-4: cranfield 1245 : 0.22095557580089653
Rank-5: cranfield 1081 : 0.21840296451586416
Rank-6: cranfield 984 : 0.21675925669921514
Rank-7: cranfield 242 : 0.21405770165762125
Rank-8: cranfield 660 : 0.21266127263494033
Rank-9: cranfield 148 : 0.20723166010960398
Rank-10: cranfield 310 : 0.19908518802871253
---------------
Total number of related documents 647
---------------
0.24351000785827637
0.22838091850280762
0.24750018119812012
0.23785090446472168
0.24008584022521973
0.24260497093200684
0.2255392074584961
0.23802900314331055
0.22886109352111816
0.23388910293579102
Average Execution time T3 0.2366251230239868
Do you want to continue : Y/N ?....

Query 8 : “method”
please enter your search query...method
T2 Results
*************
Rank-1: cranfield 1087 : 0.41190345286899754
Rank-2: cranfield 264 : 0.37405413043997754
Rank-3: cranfield 470 : 0.3097213507355888
Rank-4: cranfield 461 : 0.29951792586315257
Rank-5: cranfield 756 : 0.28289621182424124
Rank-6: cranfield 501 : 0.27889702369296543
Rank-7: cranfield 435 : 0.25485561764277387
Rank-8: cranfield 454 : 0.2520442738949897
Rank-9: cranfield 273 : 0.25022294433632775
Rank-10: cranfield 91 : 0.25000842178924537
---------------
Total number of related documents 362
---------------
0.42433905601501465
0.4105088710784912
0.4029409885406494
0.39601802825927734
0.4147911071777344
0.4005770683288574
0.40909910202026367
0.40198397636413574
0.4067959785461426
0.420957088470459
Average Execution time T2 0.40880112648010253
T3 Results
*************
Rank-1: cranfield 1087 : 0.41190345286899754
Rank-2: cranfield 264 : 0.37405413043997754
Rank-3: cranfield 470 : 0.3097213507355888
Rank-4: cranfield 461 : 0.29951792586315257
Rank-5: cranfield 756 : 0.28289621182424124
Rank-6: cranfield 501 : 0.27889702369296543
Rank-7: cranfield 435 : 0.25485561764277387
Rank-8: cranfield 454 : 0.2520442738949897
Rank-9: cranfield 273 : 0.25022294433632775
Rank-10: cranfield 91 : 0.25000842178924537
---------------
Total number of related documents 362
---------------
0.14066386222839355
0.12710094451904297
0.12989091873168945
0.1403520107269287
0.13263797760009766
0.13820815086364746
0.12533903121948242
0.12954497337341309
0.1296861171722412
0.1407911777496338
Average Execution time T3 0.13342151641845704
Do you want to continue : Y/N ?....

Query 9 : “arun”
Search query not found user defined exception





Key takeaway:

Query
Avg T2 Time
Avg T3 Time
clear survey duration
0.3824115991592407
0.01227877140045166
equation simple conditions
0.4023820161819458
0.1988461971282959
flow method analysis
0.4120296001434326
0.34248809814453124
governing equations
0.4102623462677002
0.1077491044998169
heated elements
0.3943652153015137
0.01381092071533203
layer assumptions
0.4075770854949951
0.16498420238494874
flow
0.40768449306488036
0.2366251230239868
method
0.40880112648010253
0.13342151641845704


By seeing the above table, it’s very clear that indexing is really powerful, and it makes our search faster.

No comments :

Post a Comment