Document Distance: Program Version 5
Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries
The get_words_from_string routine is character-oriented: it must do some processing on each character of the input file(s). Thus, although the running time of this routine is linear in the size of the input, it is nonetheless expensive in the end, because there many are more characters in the file than there are words in the file.
One of the nice things about Python is that it has extensive libraries of built-in routines that are efficiently implemented. This is especially true for string-processing routines; the module string, for example, contains many fast and very useful routines. We'll now use some of these to implement a much faster version of get_words_from_string.
Our strategy is simple:
- Using the string.translate routine, convert all non-alphanumeric characters to blanks, while simultaneously converting all upper-case letters into lower-case letters. For example:
string.translate("(Hi) David. What's up?",tab) ==> " hi david what s up "
when tab is an appropriate "translation table".
- Using the routine string.split, split the text line into its constituent words. Applying string.split to the previous string yields: ["hi","david","what","s","up"].
Here is the modified routine
# global variables needed for fast parsing # translation table maps upper case to lower case and punctuation to spaces translation_table = string.maketrans(string.punctuation+string.uppercase, " "*len(string.punctuation)+string.lowercase) def get_words_from_string(line): """ Return a list of the words in the given input string, converting each word to lower-case. Input: line (a string) Output: a list of strings (each string is a sequence of alphanumeric characters) """ line = line.translate(translation_table) word_list = line.split() return word_list
The modified document distance routine is docdist5.py (PY).
Running this on our standard example gives the following output:
docdist5.py t2.bobsey.txt t3.lewis.txt File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words The distance between the documents is: 0.574160 (radians) 366048 function calls in 13.859 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :0(acos) 22663 0.089 0.000 0.089 0.000 :0(extend) 232140 0.751 0.000 0.751 0.000 :0(has_key) 2 0.020 0.010 0.020 0.010 :0(items) 43228 0.143 0.000 0.143 0.000 :0(len) 2 0.001 0.000 0.001 0.000 :0(open) 2 0.000 0.000 0.000 0.000 :0(range) 2 0.013 0.007 0.013 0.007 :0(readlines) 1 0.005 0.005 0.005 0.005 :0(setprofile) 22663 0.144 0.000 0.144 0.000 :0(split) 1 0.000 0.000 0.000 0.000 :0(sqrt) 22663 0.077 0.000 0.077 0.000 :0(translate) 1 0.003 0.003 13.854 13.854 <string>:1(<module>) 2 0.895 0.447 1.665 0.833 docdist5.py:105(count_frequency) 2 11.125 5.562 11.125 5.563 docdist5.py:120(insertion_sort) 2 0.001 0.000 13.518 6.759 docdist5.py:142(word_frequencies_for_file) 3 0.179 0.060 0.321 0.107 docdist5.py:160(inner_product) 1 0.000 0.000 0.322 0.322 docdist5.py:186(vector_angle) 1 0.011 0.011 13.851 13.851 docdist5.py:196(main) 2 0.000 0.000 0.014 0.007 docdist5.py:55(read_file) 2 0.176 0.088 0.713 0.356 docdist5.py:71(get_words_from_line_list) 22663 0.226 0.000 0.447 0.000 docdist5.py:89(get_words_from_string) 1 0.000 0.000 13.859 13.859 profile:0(main()) 0 0.000 0.000 profile:0(profiler)
Excellent!! Now the only "nail left to hit" is insertion_sort, which takes time Î(n2) in the worst-case. In order to make this program work well on larger inputs (e.g. the complete works of Shakespeare), we need to replace insertion_sort with something faster.