a) Description
of my source code [Python] :
In this project, I build a simple information retrieval system using vector space model. I have used the popular IR data set (cranfield document collection). This project supports up to three search keywords.
Python version : 3.5
This file removes all the SGML tags from the files, and pass
the remaining words to the add_map() function, where the processed words add to
the hashmap
Input : Each corpus data with SGML tags
2) def add_map(data) :
It does the text transformation and adds all the data words
to the hashmap.
Input : Each File data contents
Output : Hashmap contains token,count
3) def
remove_stopwords_from_map(word_map,stop_words):
This function
removes the stop words from the hashmap.
Input : word_map
: Hashmap of all the tokens
Input:
stop_words : List of all the stop words.
4) def
build_inverted_index(word_map, all_files):
This function
takes all the file names, and the word hashmap as an input. This function
process all the documents, one by one, and builds an inverted index for each
word in the documents. Once, inverted index is ready, my program asks the user
to input the search query. Then the
search query is passed to the search_query(key_words) function
Input:
word_map -
Hashmap contains token after removing stop words
all_files : all
file names(1400 files)
5) def
top_words(word_map):
This function
sorts the hashmap, based on the value, and it returns the top 10 values as a
rank. It also prints the number of related documents based on the query to the
user.
6) def
generate_tfidf(all_files , position_map):
This function takes the top 1000 most frequent words and
generate the tf-idf matrix for 1400 files. It returns tfidf matrix.
7) def
transform_query_tfidf(query,position_map):
It takes the query and index position of top 1000 words as
an input, and it transform it to tfidf values.
8) def generate_results_t2(tfidf, query_tfidf):
This function takes tfidf matrix of 1400 documents, and
query_tfidf as an input, and it calculates the cosine similarity between the
query against all those 1400 documents. Then, it arranges the ranking based on
the highest cosine similarity first.
9) def generate_results_t3(tfidf, query_tfidf,
rank_map ):
This function takes tfidf matrix, query_tfidf values, and
rank_map, which is a index map of only query words and it’s documents. It
improves the search faster.
10) def
process_results(query, tfidf):
This function calls T2 and T3, then it process the results
to display the top 10 search query, and it calculates the execution time over
10 times, and it displays the result to the user.
User Interactions:
In my program, users are allowed to give input the following
two things
1)
Users can pass the corpus folder name
2)
Users can enter their search query
Libraries used:
I did not used any external IR library for my work. However,
I used the following default python libraries to process system manipulations
and calculations work.
1.
os – reading files from a folder
2.
time – calculate the running time of a function
call
3.
re – regular expression to remove sgml tags
4.
math – calculate log , sqrt values for cosine
similarity and tfidf transformations.
b)
The size of vector space model
and its ratio in comparison to the size of the original corpus:
I dumped my Vector space model to a text file. It shows that
it’s size as 5.2 mb on disk .
The size of original corpus (1400 files) is 5.8 mb on disk
(Mac computer) . So, it’s ratio is 5.2/5.8
However, size on disk depends on various parameters.
c) Results
Query 1: “ clear survey duration”
T2 Results
*************
Rank-1: cranfield 143 : 0.43292292527837434
Rank-2: cranfield 898 : 0.37523281454087926
Rank-3: cranfield 892 : 0.2817757017318109
Rank-4: cranfield 378 : 0.27174246466052293
Rank-5: cranfield 632 : 0.2705221582864765
Rank-6: cranfield 1170 : 0.2389120540160786
Rank-7: cranfield 93 : 0.23774471738798286
Rank-8: cranfield 1310 : 0.20799827962656017
Rank-9: cranfield 743 : 0.16245456307538253
Rank-10: cranfield 136 : 0.14231114695855476
---------------
Total number of related documents 21
---------------
0.4031250476837158
0.3755300045013428
0.37288594245910645
0.3897981643676758
0.37694787979125977
0.3704710006713867
0.3843810558319092
0.37485289573669434
0.37755298614501953
0.3985710144042969
Average Execution time T2 0.3824115991592407
T3 Results
*************
Rank-1: cranfield 143 : 0.43292292527837434
Rank-2: cranfield 898 : 0.37523281454087926
Rank-3: cranfield 892 : 0.2817757017318109
Rank-4: cranfield 378 : 0.27174246466052293
Rank-5: cranfield 632 : 0.2705221582864765
Rank-6: cranfield 1170 : 0.2389120540160786
Rank-7: cranfield 93 : 0.23774471738798286
Rank-8: cranfield 1310 : 0.20799827962656017
Rank-9: cranfield 743 : 0.16245456307538253
Rank-10: cranfield 136 : 0.14231114695855476
---------------
Total number of related documents 21
---------------
0.01220703125
0.01158285140991211
0.011934995651245117
0.012199878692626953
0.012449979782104492
0.01160883903503418
0.014822959899902344
0.012479066848754883
0.012227058410644531
0.011275053024291992
Average Execution time T3 0.01227877140045166
Query 2: “equation simple
conditions”
please
enter your search query...equation simple conditions
T2
Results
*************
Rank-1:
cranfield 356 : 0.29379601216527984
Rank-2:
cranfield 637 : 0.27391995352001974
Rank-3:
cranfield 987 : 0.27027337255457623
Rank-4:
cranfield 936 : 0.2640905516879408
Rank-5:
cranfield 115 : 0.2636739088301488
Rank-6:
cranfield 956 : 0.24073135894787517
Rank-7:
cranfield 190 : 0.23818227820192306
Rank-8:
cranfield 377 : 0.23435833714672133
Rank-9:
cranfield 931 : 0.23223202349984512
Rank-10:
cranfield 32 : 0.23210751978721522
---------------
Total
number of related documents 531
---------------
0.4186129570007324
0.3894839286804199
0.39535999298095703
0.38993120193481445
0.4131910800933838
0.38707709312438965
0.388077974319458
0.3898200988769531
0.38815903663635254
0.46410679817199707
Average
Execution time T2 0.4023820161819458
T3
Results
*************
Rank-1:
cranfield 356 : 0.29379601216527984
Rank-2:
cranfield 637 : 0.27391995352001974
Rank-3:
cranfield 987 : 0.27027337255457623
Rank-4:
cranfield 936 : 0.2640905516879408
Rank-5:
cranfield 115 : 0.2636739088301488
Rank-6:
cranfield 956 : 0.24073135894787517
Rank-7:
cranfield 190 : 0.23818227820192306
Rank-8:
cranfield 377 : 0.23435833714672133
Rank-9:
cranfield 931 : 0.23223202349984512
Rank-10:
cranfield 32 : 0.23210751978721522
---------------
Total
number of related documents 531
---------------
0.2026369571685791
0.1967639923095703
0.18050193786621094
0.18079710006713867
0.18939900398254395
0.1811840534210205
0.1868269443511963
0.21831607818603516
0.23627495765686035
0.2157609462738037
Average
Execution time T3 0.1988461971282959
Query 3:flow method
analysis
please
enter your search query...flow method analysis
T2
Results
*************
Rank-1:
cranfield 984 : 0.37638717676065275
Rank-2:
cranfield 1384 : 0.2572367282399113
Rank-3:
cranfield 1087 : 0.23959621484856067
Rank-4:
cranfield 1386 : 0.22563192832186038
Rank-5:
cranfield 756 : 0.21819473990519855
Rank-6:
cranfield 264 : 0.21758000127858096
Rank-7:
cranfield 990 : 0.21485385151484018
Rank-8:
cranfield 648 : 0.20419704910258707
Rank-9:
cranfield 1367 : 0.19711086189585206
Rank-10:
cranfield 775 : 0.18921080840594046
---------------
Total
number of related documents 920
---------------
0.42834019660949707
0.3946080207824707
0.4228990077972412
0.4098498821258545
0.4058959484100342
0.41594696044921875
0.4036121368408203
0.42835092544555664
0.40869593620300293
0.4020969867706299
Average
Execution time T2 0.4120296001434326
T3
Results
*************
Rank-1:
cranfield 984 : 0.37638717676065275
Rank-2:
cranfield 1384 : 0.2572367282399113
Rank-3:
cranfield 1087 : 0.23959621484856067
Rank-4:
cranfield 1386 : 0.22563192832186038
Rank-5:
cranfield 756 : 0.21819473990519855
Rank-6:
cranfield 264 : 0.21758000127858096
Rank-7:
cranfield 990 : 0.21485385151484018
Rank-8:
cranfield 648 : 0.20419704910258707
Rank-9:
cranfield 1367 : 0.19711086189585206
Rank-10:
cranfield 775 : 0.18921080840594046
---------------
Total
number of related documents 920
---------------
0.3422510623931885
0.32555580139160156
0.3550689220428467
0.377838134765625
0.3330559730529785
0.33521103858947754
0.3388819694519043
0.3463459014892578
0.3318021297454834
0.3388700485229492
Average
Execution time T3 0.34248809814453124
Query 4 : “governing
equations”
please
enter your search query...governing equations
T2
Results
*************
Rank-1:
cranfield 180 : 0.44383181762947965
Rank-2:
cranfield 298 : 0.32698407600175483
Rank-3:
cranfield 737 : 0.22822596231604902
Rank-4:
cranfield 639 : 0.2197615048669502
Rank-5:
cranfield 1240 : 0.21309867438483113
Rank-6:
cranfield 374 : 0.2069559601367719
Rank-7:
cranfield 362 : 0.2064477945868797
Rank-8:
cranfield 590 : 0.19683594983620756
Rank-9:
cranfield 1245 : 0.17012834868656784
Rank-10:
cranfield 1233 : 0.15756079656165212
---------------
Total
number of related documents 294
---------------
0.4385688304901123
0.4071469306945801
0.39668798446655273
0.4025897979736328
0.40557384490966797
0.42664504051208496
0.405059814453125
0.41585803031921387
0.39579200744628906
0.40870118141174316
Average
Execution time T2 0.4102623462677002
T3
Results
*************
Rank-1:
cranfield 180 : 0.44383181762947965
Rank-2:
cranfield 298 : 0.32698407600175483
Rank-3:
cranfield 737 : 0.22822596231604902
Rank-4:
cranfield 639 : 0.2197615048669502
Rank-5:
cranfield 1240 : 0.21309867438483113
Rank-6:
cranfield 374 : 0.2069559601367719
Rank-7:
cranfield 362 : 0.2064477945868797
Rank-8:
cranfield 590 : 0.19683594983620756
Rank-9:
cranfield 1245 : 0.17012834868656784
Rank-10:
cranfield 1233 : 0.15756079656165212
---------------
Total
number of related documents 294
---------------
0.11075496673583984
0.12650394439697266
0.1074061393737793
0.10495901107788086
0.10561299324035645
0.10079002380371094
0.10472583770751953
0.10688400268554688
0.10442399978637695
0.10543012619018555
Average
Execution time T3 0.1077491044998169
Do
you want to continue : Y/N ?....
Query 5: “heated element”
Enter your search query ... heated element
please
enter your search query...heated element
T2
Results
*************
Rank-1:
cranfield 8 : 0.4665384859565601
Rank-2:
cranfield 13 : 0.4542617554466098
Rank-3:
cranfield 96 : 0.43106985174883966
Rank-4:
cranfield 135 : 0.38677609642392513
Rank-5:
cranfield 835 : 0.36480935735760384
Rank-6:
cranfield 1268 : 0.2892231040067274
Rank-7:
cranfield 7 : 0.2866660361642518
Rank-8:
cranfield 122 : 0.227111193479825
Rank-9:
cranfield 509 : 0.2221391675801983
Rank-10:
cranfield 1294 : 0.20467402309728328
---------------
Total
number of related documents 38
---------------
0.40689706802368164
0.3818209171295166
0.4345400333404541
0.4057309627532959
0.3835599422454834
0.3773460388183594
0.38951706886291504
0.37604308128356934
0.4043550491333008
0.38384199142456055
Average
Execution time T2 0.3943652153015137
T3
Results
*************
Rank-1:
cranfield 8 : 0.4665384859565601
Rank-2:
cranfield 13 : 0.4542617554466098
Rank-3:
cranfield 96 : 0.43106985174883966
Rank-4:
cranfield 135 : 0.38677609642392513
Rank-5:
cranfield 835 : 0.36480935735760384
Rank-6:
cranfield 1268 : 0.2892231040067274
Rank-7:
cranfield 7 : 0.2866660361642518
Rank-8:
cranfield 122 : 0.227111193479825
Rank-9:
cranfield 509 : 0.2221391675801983
Rank-10:
cranfield 1294 : 0.20467402309728328
---------------
Total
number of related documents 38
---------------
0.01381993293762207
0.013023138046264648
0.013969898223876953
0.01398611068725586
0.013659000396728516
0.014817953109741211
0.01394510269165039
0.013763904571533203
0.013024091720581055
0.014100074768066406
Average
Execution time T3 0.01381092071533203
Do
you want to continue : Y/N ?....
Query 6: “layer
assumptions”
please
enter your search query...layer assumptions
T2
Results
*************
Rank-1:
cranfield 499 : 0.2558707588253873
Rank-2:
cranfield 1031 : 0.23910277856273968
Rank-3:
cranfield 539 : 0.20548199178724574
Rank-4:
cranfield 489 : 0.1969338609390281
Rank-5:
cranfield 327 : 0.18164902013108386
Rank-6:
cranfield 928 : 0.17669426223053705
Rank-7:
cranfield 195 : 0.17534591466254595
Rank-8:
cranfield 306 : 0.17361580327828752
Rank-9:
cranfield 449 : 0.17225757455472393
Rank-10:
cranfield 1320 : 0.17148058214878717
---------------
Total
number of related documents 414
---------------
0.4137279987335205
0.4083528518676758
0.39133715629577637
0.4147059917449951
0.40894293785095215
0.39607882499694824
0.3984508514404297
0.39461398124694824
0.4341270923614502
0.4154331684112549
Average
Execution time T2 0.4075770854949951
T3
Results
*************
Rank-1:
cranfield 499 : 0.2558707588253873
Rank-2:
cranfield 1031 : 0.23910277856273968
Rank-3:
cranfield 539 : 0.20548199178724574
Rank-4:
cranfield 489 : 0.1969338609390281
Rank-5:
cranfield 327 : 0.18164902013108386
Rank-6:
cranfield 928 : 0.17669426223053705
Rank-7:
cranfield 195 : 0.17534591466254595
Rank-8:
cranfield 306 : 0.17361580327828752
Rank-9:
cranfield 449 : 0.17225757455472393
Rank-10:
cranfield 1320 : 0.17148058214878717
---------------
Total
number of related documents 414
---------------
0.16916179656982422
0.15564298629760742
0.18361997604370117
0.20829010009765625
0.16690707206726074
0.15019893646240234
0.1474289894104004
0.17515802383422852
0.15147113800048828
0.14196300506591797
Average
Execution time T3 0.16498420238494874
Do
you want to continue : Y/N ?....
Query 7 : “flow”
please
enter your search query...flow
T2
Results
*************
Rank-1:
cranfield 775 : 0.3036240609023336
Rank-2:
cranfield 97 : 0.27541148045561853
Rank-3:
cranfield 404 : 0.24792646750217778
Rank-4:
cranfield 1245 : 0.22095557580089653
Rank-5:
cranfield 1081 : 0.21840296451586416
Rank-6:
cranfield 984 : 0.21675925669921514
Rank-7:
cranfield 242 : 0.21405770165762125
Rank-8:
cranfield 660 : 0.21266127263494033
Rank-9:
cranfield 148 : 0.20723166010960398
Rank-10:
cranfield 310 : 0.19908518802871253
---------------
Total
number of related documents 647
---------------
0.4213838577270508
0.41815805435180664
0.390963077545166
0.40816402435302734
0.42156410217285156
0.4006001949310303
0.41673994064331055
0.3968667984008789
0.4012148380279541
0.40119004249572754
Average
Execution time T2 0.40768449306488036
T3
Results
*************
Rank-1:
cranfield 775 : 0.3036240609023336
Rank-2:
cranfield 97 : 0.27541148045561853
Rank-3:
cranfield 404 : 0.24792646750217778
Rank-4:
cranfield 1245 : 0.22095557580089653
Rank-5:
cranfield 1081 : 0.21840296451586416
Rank-6:
cranfield 984 : 0.21675925669921514
Rank-7:
cranfield 242 : 0.21405770165762125
Rank-8:
cranfield 660 : 0.21266127263494033
Rank-9:
cranfield 148 : 0.20723166010960398
Rank-10:
cranfield 310 : 0.19908518802871253
---------------
Total
number of related documents 647
---------------
0.24351000785827637
0.22838091850280762
0.24750018119812012
0.23785090446472168
0.24008584022521973
0.24260497093200684
0.2255392074584961
0.23802900314331055
0.22886109352111816
0.23388910293579102
Average
Execution time T3 0.2366251230239868
Do
you want to continue : Y/N ?....
Query 8 : “method”
please
enter your search query...method
T2
Results
*************
Rank-1:
cranfield 1087 : 0.41190345286899754
Rank-2:
cranfield 264 : 0.37405413043997754
Rank-3:
cranfield 470 : 0.3097213507355888
Rank-4:
cranfield 461 : 0.29951792586315257
Rank-5:
cranfield 756 : 0.28289621182424124
Rank-6:
cranfield 501 : 0.27889702369296543
Rank-7:
cranfield 435 : 0.25485561764277387
Rank-8:
cranfield 454 : 0.2520442738949897
Rank-9:
cranfield 273 : 0.25022294433632775
Rank-10:
cranfield 91 : 0.25000842178924537
---------------
Total
number of related documents 362
---------------
0.42433905601501465
0.4105088710784912
0.4029409885406494
0.39601802825927734
0.4147911071777344
0.4005770683288574
0.40909910202026367
0.40198397636413574
0.4067959785461426
0.420957088470459
Average
Execution time T2 0.40880112648010253
T3
Results
*************
Rank-1:
cranfield 1087 : 0.41190345286899754
Rank-2:
cranfield 264 : 0.37405413043997754
Rank-3:
cranfield 470 : 0.3097213507355888
Rank-4:
cranfield 461 : 0.29951792586315257
Rank-5:
cranfield 756 : 0.28289621182424124
Rank-6:
cranfield 501 : 0.27889702369296543
Rank-7:
cranfield 435 : 0.25485561764277387
Rank-8:
cranfield 454 : 0.2520442738949897
Rank-9:
cranfield 273 : 0.25022294433632775
Rank-10:
cranfield 91 : 0.25000842178924537
---------------
Total
number of related documents 362
---------------
0.14066386222839355
0.12710094451904297
0.12989091873168945
0.1403520107269287
0.13263797760009766
0.13820815086364746
0.12533903121948242
0.12954497337341309
0.1296861171722412
0.1407911777496338
Average
Execution time T3 0.13342151641845704
Do
you want to continue : Y/N ?....
Query 9 : “arun”
Search query not found user
defined exception
Key takeaway:
Query
|
Avg T2 Time
|
Avg T3 Time
|
clear survey duration
|
0.3824115991592407
|
0.01227877140045166
|
equation simple conditions
|
0.4023820161819458
|
0.1988461971282959
|
flow method analysis
|
0.4120296001434326
|
0.34248809814453124
|
governing equations
|
0.4102623462677002
|
0.1077491044998169
|
heated elements
|
0.3943652153015137
|
0.01381092071533203
|
layer assumptions
|
0.4075770854949951
|
0.16498420238494874
|
flow
|
0.40768449306488036
|
0.2366251230239868
|
method
|
0.40880112648010253
|
0.13342151641845704
|
By seeing the above table, it’s very clear that indexing is really
powerful, and it makes our search faster.
No comments :
Post a Comment