A simple and efficient way to get document frequency counts of words from a corpus is to use CountVectorizer from Scikit Learn
Getting back to the word from the index is not immediately obvious, here's how to do it:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
docs = <load your docs as an iterable>
count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(docs) # this is of type scipy.sparse.csr.csr_matrix which is why we need to use
.ravel() below.
word_counts = zip(count_vect.get_feature_names(), np.asarray(doc_counts.sum(axis=0)).ravel())
word_counts = sorted(word_counts, key=lambda idx: -1 * idx[1] )
# Display top 100 words by frequency
word_counts[:100]
Tuesday, June 21, 2016
Thursday, June 9, 2016
How to 7zip each file seperately in a directory (Windows)
Let's zip up all those log files sitting in that directory into seperate 7z files....
FOR %i IN (*.*) DO "C:\Program Files\7-Zip\7z.exe" a -mx=9 "%i.7z" "%i"
Refs:
http://superuser.com/questions/312652/how-do-i-create-seperate-7z-files-from-each-selected-directory-with-7zip-command
http://askubuntu.com/questions/491223/7z-ultra-settings-for-zip-format
FOR %i IN (*.*) DO "C:\Program Files\7-Zip\7z.exe" a -mx=9 "%i.7z" "%i"
Refs:
http://superuser.com/questions/312652/how-do-i-create-seperate-7z-files-from-each-selected-directory-with-7zip-command
http://askubuntu.com/questions/491223/7z-ultra-settings-for-zip-format
Subscribe to:
Comments (Atom)