Tuesday, June 21, 2016

How to list word occurences using CountVectorizer from Scikit Learn

A simple and efficient way to get document frequency counts of words from a corpus is to use CountVectorizer from Scikit Learn

Getting back to the word from the index is not immediately obvious, here's how to do it:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

docs = <load your docs as an iterable>

count_vect = CountVectorizer()
doc_counts = count_vect.fit_transform(docs)  # this is of type scipy.sparse.csr.csr_matrix which is why we need to use

.ravel() below.

word_counts = zip(count_vect.get_feature_names(), np.asarray(doc_counts.sum(axis=0)).ravel())
word_counts = sorted(word_counts, key=lambda idx: -1 * idx[1] )
 

# Display top 100 words by frequency
word_counts[:100]

Thursday, June 9, 2016

How to 7zip each file seperately in a directory (Windows)

Let's zip up all those log files sitting in that directory into seperate 7z files....

FOR %i IN (*.*) DO "C:\Program Files\7-Zip\7z.exe" a -mx=9 "%i.7z" "%i"

Refs:
http://superuser.com/questions/312652/how-do-i-create-seperate-7z-files-from-each-selected-directory-with-7zip-command
http://askubuntu.com/questions/491223/7z-ultra-settings-for-zip-format