Document Distance: Program Version 6
Problem Definition | Data Sets | Programs: v1 - v2 - v3 - v4 - v5 - v6 | Programs Using Dictionaries
To improve the running time of our document distance program, we replace insertion_sort with merge_sort, which is known to run in time Î(nlgn), instead of the Î(n2) running time of insertion_sort.
Here is code for merge_sort:
def merge_sort(A): """ Sort list A into order, and return result. """ n = len(A) if n==1: return A mid = n//2 # floor division L = merge_sort(A[:mid]) R = merge_sort(A[mid:]) return merge(L,R) def merge(L,R): """ Given two sorted sequences L and R, return their merge. """ i = 0 j = 0 answer = [] while i<len(L) and j<len(R): if L[i]<R[j]: answer.append(L[i]) i += 1 else: answer.append(R[j]) j += 1 if i<len(L): answer.extend(L[i:]) if j<len(R): answer.extend(R[j:]) return answer
The revised document distance program is now docdist6.py (PY).
Running this on our standard inputs, we obtain:
>docdist6.py t2.bobsey.txt t3.lewis.txt File t2.bobsey.txt : 6667 lines, 49785 words, 3354 distinct words File t3.lewis.txt : 15996 lines, 182355 words, 8530 distinct words The distance between the documents is: 0.574160 (radians) 885435 function calls (861671 primitive calls) in 6.630 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :0(acos) 135633 0.486 0.000 0.486 0.000 :0(append) 34545 0.132 0.000 0.132 0.000 :0(extend) 232140 0.760 0.000 0.760 0.000 :0(has_key) 2 0.018 0.009 0.018 0.009 :0(items) 379456 1.327 0.000 1.327 0.000 :0(len) 2 0.001 0.000 0.001 0.000 :0(open) 2 0.014 0.007 0.014 0.007 :0(readlines) 1 0.004 0.004 0.004 0.004 :0(setprofile) 22663 0.146 0.000 0.146 0.000 :0(split) 1 0.000 0.000 0.000 0.000 :0(sqrt) 22663 0.079 0.000 0.079 0.000 :0(translate) 1 0.003 0.003 6.626 6.626 <string>:1(<module>) 2 0.889 0.445 1.667 0.833 docdist6.py:107(count_frequency) 23766/2 0.335 0.000 3.900 1.950 docdist6.py:122(merge_sort) 11882 1.849 0.000 3.481 0.000 docdist6.py:134(merge) 2 0.001 0.001 6.297 3.148 docdist6.py:176(word_frequencies_for_file) 3 0.177 0.059 0.316 0.105 docdist6.py:194(inner_product) 1 0.000 0.000 0.316 0.316 docdist6.py:220(vector_angle) 1 0.011 0.011 6.624 6.624 docdist6.py:230(main) 2 0.000 0.000 0.014 0.007 docdist6.py:57(read_file) 2 0.176 0.088 0.714 0.357 docdist6.py:73(get_words_from_line_list) 22663 0.223 0.000 0.448 0.000 docdist6.py:91(get_words_from_string) 1 0.000 0.000 6.630 6.630 profile:0(main()) 0 0.000 0.000 profile:0(profiler)
Excellent! We have reduced the overall running time from over three minutes to under six seconds -- almost two orders of magnitude improvement in running time -- and the running time should now scale nearly linearly with the size of the input(s). (Where Î(nlgn) is "nearly linear".)
We can now attempt to run this on our large inputs, such as compared the complete works of Shakespeare with the complete works of Winston Churchill:
>docdist6.py t5.churchill.txt t8.shakespeare.txt File t5.churchill.txt : 189685 lines, 1717247 words, 32544 distinct words File t8.shakespeare.txt : 124456 lines, 929462 words, 23881 distinct words The distance between the documents is: 0.462095 (radians) 6926117 function calls (6813271 primitive calls) in 52.886 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :0(acos) 763122 2.828 0.000 2.828 0.000 :0(append) 370564 1.565 0.000 1.565 0.000 :0(extend) 2646709 8.770 0.000 8.770 0.000 :0(has_key) 2 0.280 0.140 0.280 0.140 :0(items) 2034004 7.302 0.000 7.302 0.000 :0(len) 2 0.001 0.000 0.001 0.000 :0(open) 2 0.178 0.089 0.178 0.089 :0(readlines) 1 0.004 0.004 0.004 0.004 :0(setprofile) 314141 1.862 0.000 1.862 0.000 :0(split) 1 0.000 0.000 0.000 0.000 :0(sqrt) 314141 1.108 0.000 1.108 0.000 :0(translate) 1 0.025 0.025 52.882 52.882 <string>:1(<module>) 2 10.369 5.184 19.418 9.709 docdist6.py:107(count_frequency) 112848/2 1.643 0.000 21.717 10.858 docdist6.py:122(merge_sort) 56423 10.385 0.000 19.662 0.000 docdist6.py:134(merge) 2 0.011 0.006 51.227 25.614 docdist6.py:176(word_frequencies_for_file) 3 0.836 0.279 1.491 0.497 docdist6.py:194(inner_product) 1 0.000 0.000 1.492 1.492 docdist6.py:220(vector_angle) 1 0.138 0.138 52.856 52.856 docdist6.py:230(main) 2 0.000 0.000 0.179 0.089 docdist6.py:57(read_file) 2 2.451 1.225 9.902 4.951 docdist6.py:73(get_words_from_line_list) 314141 3.131 0.000 6.101 0.000 docdist6.py:91(get_words_from_string) 1 0.000 0.000 52.886 52.886 profile:0(main()) 0 0.000 0.000 profile:0(profiler)
Very nice! These are large files (many megabytes), and yet we are able to compute the distance between them fairly efficiently (under a minute). If we were to continue searching for efficiency improvements, we would continue to look at sorting and at the data structures involved in count_frequency.
Exercise: Can you eliminate the need for sorting altogether in this program? Re-code the program to do so.