Similarity search on Wikipedia using gensim in Python.
The goals of this project are the following two features:
- Create LSI vector representations of all the articles in English Wikipedia using a modified version of the make_wikicorpus.py script in gensim.
- Perform concept searches and other fun text analysis on Wikipedia, also using gensim functionality.
I started with the make_wikicorpus.py script from gensim, and the results of my script are nearly identical.
My changes were the following:
- I broke out each of the steps and commented the hell out of them to explain what was going on in each.
- For clarity and simplicity, I removed the "online" mode of operation.
- I modified the script to save out the names of all of the Wikipedia articles as well, so that you could perform searches against the dataset and get the names of the matching articles.
- I added the conversion to LSI step.
I pulled down the latest Wikipedia dump on 1/18/17; here are some statistics on it:
17,180,273 | Total number of articles (without any filtering) |
4,198,780 | Number of articles after filtering out "article redirects" and "short stubs" |
2,355,066,808 | Total number of tokens in all articles (without any filtering) |
2,292,505,314 | Total number of tokens after filtering articles |
8,746,676 | Total number of unique words found in all articles (*after* filtering articles) |
Vectorizing all of Wikipedia is a fairly lengthy process, and the data files are large. Here is what you can expect from each step of the process.
These numbers are from running on my desktop PC, which has an Intel Core i7 4770, 16GB of RAM, and an SSD.
# | Step | Time (h:m) | Output File | File Size |
0 | Download Wikipedia Dump | -- | enwiki-latest-pages-articles.xml.bz2 | 12.6 GB |
1 | Parse Wikipedia & Build Dictionary | 3:12 | dictionary.txt.bz2 | 769 KB |
2 | Convert articles to bag-of-words vectors | 3:32 | bow.mm | 9.44 GB |
2a. | Store article titles | -- | bow.mm.metadata.cpickle | 152 MB |
3 | Learn tf-idf model from document statistics | 0:47 | tfidf.tfidf_model | 4.01 MB |
4 | Convert articles to tf-idf | 1:40 | corpus_tfidf.mm | 17.9 GB |
5 | Learn LSI model with 300 topics | 2:07 | lsi.lsi_model | 3.46 MB |
lsi.lsi_model.projection | 3 KB | |||
lsi.lsi_model.projection.u.npy | 228 MB | |||
6 | Convert articles to LSI | 0:58 | lsi_index.mm | 1 KB |
lsi_index.mm.index.npy | 4.69 GB | |||
TOTALS | 12:16 | 45 GB |
I recommend converting the LSI vectors directly to a MatrixSimilarity class rather than performing the intermediate step of creating and saving an "LSI corpus". If you do, it takes longer and the resulting file is huge:
6 | Convert articles to LSI and save as MmCorpus | 2:34 | corpus_lsi.mm | 33.2 GB |
The final LSI matrix is pretty huge. We have ~4.2M articles with 300 features, and the features are 32-bit (4-byte) floats.
To store this matrix in memory, we need (4.2E6 * 300 * 4) / (2^30) = 4.69GB of RAM!
Once the script is done, you can delete bow.mm (9.44 GB), but the rest of the data you'll want to keep for performing searches.
Before running the script, download the latest Wikipedia dump here: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Save the dump file in the ./data/ directory of this project.
Then, run make_wikicorpus.py
to fully parse Wikipedia and generate the LSI index!
The script enables gensim logging, and saves all the logging to log.txt
in the project directory. I've included an example log.txt in the project. You can open this log while the script is running to get more detailed progress updates.
The script also prints an overview to the console; here is an exmaple output:
Parsing Wikipedia to build Dictionary...
Building dictionary took 3:05
8746676 unique tokens before pruning.
Converting to bag of words...
Conversion to bag-of-words took 3:47
Learning tf-idf model from data...
Building tf-idf model took 0:47
Applying tf-idf model to all vectors...
Applying tf-idf model took 1:40
Learning LSI model from the tf-idf vectors...
Building LSI model took 2:07
Applying LSI model to all vectors...
Applying LSI model took 2:00
Once you have the LSI vectors for Wikipedia, you're ready to perform similarity searches.
The script run_search.py
shows a bare bones approach to performing a similarity search with gensim.
Here is the example output:
Loading Wikipedia LSI index (15-30sec.)...
Loading LSI vectors took 13.03 seconds
Loading Wikipedia article titles...
Searching for articles similar to 'Topic model':
Similarity search took 320 ms
Sorting took 8.45 seconds
Results:
Topic model
Online content analysis
Semantic similarity
Information retrieval
Data-oriented parsing
Concept search
Object-role modeling
Software analysis pattern
Content analysis
Adaptive hypermedia
For some more bells and whistles, I've pulled over my SimSearch project.
The SimSearch and KeySearch classes (in simsearch.py
and keysearch.py
) add a number of features:
- Supply new text as the input to a similarity search.
- Interpret similarity matches by looking at which words contributed most to the similarity.
- Identify top words in clusters of documents.
To see some of these features, look at and run searchWithSimSearch.py
Example 1 searches for articles similar to the article 'Topic model', and also interprets the top match.
Example output:
Loading Wikipedia article titles
Loading dictionary...
Took 0.81 seconds
Loading tf-idf model...
Took 0.08 seconds
Creating tf-idf corpus object (leaves the vectors on disk)...
Took 0.82 seconds
Loading LSI model...
Took 0.73 seconds
Loading Wikipedia LSI index...
Took 13.21 seconds
Searching for similar articles...
Most similar documents:
0.90 Online content analysis
0.90 Semantic similarity
0.89 Information retrieval
0.89 Data-oriented parsing
0.89 Concept search
0.89 Object-role modeling
0.89 Software analysis pattern
0.88 Content analysis
0.88 Adaptive hypermedia
0.88 Model-driven architecture
Search and sort took 9.59 seconds
Interpreting the match between 'Topic model' and 'Online content analysis' ...
Words in doc 1 which contribute most to similarity:
text +0.065
data +0.059
model +0.053
models +0.043
topic +0.034
modeling +0.031
software +0.028
analysis +0.019
topics +0.019
algorithms +0.014
digital +0.014
words +0.012
example +0.012
document +0.011
information +0.010
language +0.010
social +0.009
matrix +0.008
identify +0.008
semantic +0.008
Words in doc 2 which contribute most to similarity:
analysis +0.070 trains -0.001
text +0.067
content +0.054
methods +0.035
algorithm +0.029
research +0.027
online +0.026
models +0.026
data +0.014
researchers +0.014
words +0.013
how +0.013
communication +0.013
sample +0.012
coding +0.009
internet +0.009
web +0.009
categories +0.008
human +0.008
random +0.008
Interpreting match took 0.75 seconds
Example 2 demonstrates searching using some new input text as the query. I've included the markdown for a couple of my blog articles as example material for the search.
Prints the top 10 words associated with each of the topics, and also writes these out to topic_words.txt