"SfR Fresh" - the SfR Freeware/Shareware Archive 
Member "spambayes-1.0.4/README-DEVEL.txt" of archive spambayes-1.0.4.zip:
As a special service "SfR Fresh" has tried to format the requested source page into HTML format using source code syntax highlighting with prefixed line numbers.
Alternatively you can here view or download the uninterpreted source code file.
That can be also achieved for any archive member file by clicking within an archive contents listing on the first character of the file(path) respectively on the according byte size field.
1 Copyright (C) 2002 Python Software Foundation; All Rights Reserved
2
3 The Python Software Foundation (PSF) holds copyright on all material
4 in this project. You may use it under the terms of the PSF license;
5 see LICENSE.txt.
6
7
8 Assorted clues.
9
10
11 What's Here?
12 ============
13 Lots of mondo cool partially documented code. What else could there be <wink>?
14
15 The focus of this project so far has not been to produce the fastest or
16 smallest filters, but to set up a flexible pure-Python implementation
17 for doing algorithm research. Lots of people are making fast/small
18 implementations, and it takes an entirely different kind of effort to
19 make genuine algorithm improvements. I think we've done quite well at
20 that so far. The focus of this codebase may change to small/fast
21 later -- as is, the false positive rate has gotten too small to measure
22 reliably across test sets with 4000 hams + 2750 spams, and the f-n rate
23 has also gotten too small to measure reliably across that much training data.
24
25 The code in this project requires Python 2.2 (or later).
26
27 You should definitely check out the FAQ:
28 http://spambayes.org/faq.html
29
30
31 Primary Core Files
32 ==================
33 Options.py
34 Uses ConfigParser to allow fiddling various aspects of the classifier,
35 tokenizer, and test drivers. Create a file named bayescustomize.ini to
36 alter the defaults. Modules wishing to control aspects of their
37 operation merely do
38
39 from Options import options
40
41 near the start, and consult attributes of options. To see what options
42 are available, import Options.py and do
43
44 print Options.options.display_full()
45
46 This will print out a detailed description of each option, the allowed
47 values, and so on. (You can pass in a section or section and option
48 name to display_full if you don't want the whole list).
49
50 As an alternative to bayescustomize.ini, you can set the environment
51 variable BAYESCUSTOMIZE to a list of one or more .ini files, these will
52 be read in, in order, and applied to the options. This allows you to
53 tweak individual runs by combining fragments of .ini files. The
54 character used to separate different .ini files is platform-dependent.
55 On Unix, Linux and Mac OS X systems it is ':'. On Windows it is ';'.
56 On Mac OS 9 and earlier systems it is a NL character.
57
58 *NOTE* The separator character changed after the second alpha version of
59 the first release. Previously, if multiple files were specified in
60 BAYESCUSTOMIZE they were space-separated.
61
62 classifier.py
63 The classifier, which is the soul of the method.
64
65 tokenizer.py
66 An implementation of tokenize() that Tim can't seem to help but keep
67 working on <wink>. Generates a token stream from a message, which
68 the classifier trains on or predicts against.
69
70 chi2.py
71 A collection of statistics functions.
72
73 IMPORTANT NOTE
74 ==============
75
76 The applications have all been renamed in preparation for 1.0 - the
77 following section refers to old application names.
78
79 IMPORTANT NOTE
80 ==============
81
82 The applications have all been renamed in preparation for 1.0 - the
83 following section refers to old application names.
84
85 Apps
86 ====
87 hammie.py
88 A spamassassin-like filter which uses tokenizer and classifier (above).
89
90 hammiefilter.py
91 A simpler hammie front-end that doesn't print anything. Useful for
92 procmail filtering and scoring from your MUA.
93
94 mboxtrain.py
95 Trainer for Maildir, MH, or mbox mailboxes. Remembers which
96 messages it saw the last time you ran it, and will only train on new
97 messages or messages which should be retrained.
98
99 The idea is to run this automatically every night on your Inbox and
100 Spam folders, and then sort misclassified messages by hand. This
101 will work with any IMAP4 mail client, or any client running on the
102 server.
103
104 pop3proxy.py
105 A spam-classifying POP3 proxy. It adds a spam-judgment header to
106 each mail as it's retrieved, so you can use your email client's
107 filters to deal with them without needing to fiddle with your email
108 delivery system.
109
110 Also acts as a web server providing a user interface that allows you
111 to train the classifier, classify messages interactively, and query
112 the token database. This piece will at some point be split out into
113 a separate module.
114
115 smtpproxy.py
116 A message training SMTP proxy. It sits between your email client and
117 your SMTP server and intercepts mail to set ham and spam addresses.
118 All other mail is simply passed through to the SMTP server.
119
120 mailsort.py
121 A delivery agent that uses a CDB of word probabilities and delivers
122 a message to one of two Maildir message folders, depending on the
123 classifier score. Note that both Maildirs must be on the same
124 device.
125
126 hammiesrv.py
127 A stab at making hammie into a client/server model, using XML-RPC.
128
129 hammiecli.py
130 A client for hammiesrv.
131
132 imapfilter.py
133 A spam-classifying and training application for use with IMAP servers.
134 You can specify folders that contain mail to train as ham/spam, and
135 folders that contain mail to classify, and the filter will do so.
136 Note that this is currently in very early development and not
137 recommended for production use.
138
139
140 Test Driver Core
141 ================
142 Tester.py
143 A test-driver class that feeds streams of msgs to a classifier
144 instance, and keeps track of right/wrong percentages and lists
145 of false positives and false negatives.
146
147 TestDriver.py
148 A flexible higher layer of test helpers, building on Tester above.
149 For example, it's usable for building simple test drivers, NxN test
150 grids, and N-fold cross-validation drivers. See also rates.py,
151 cmp.py, and table.py below.
152
153 msgs.py
154 Some simple classes to wrap raw msgs, and to produce streams of
155 msgs. The test drivers use these.
156
157
158 Concrete Test Drivers
159 =====================
160 mboxtest.py
161 A concrete test driver like timtest.py, but working with a pair of
162 mailbox files rather than the specialized timtest setup.
163
164 timcv.py
165 An N-fold cross-validating test driver. Assumes "a standard" data
166 directory setup (see below)) rather than the specialized mboxtest
167 setup.
168 N classifiers are built.
169 1 run is done with each classifier.
170 Each classifier is trained on N-1 sets, and predicts against the sole
171 remaining set (the set not used to train the classifier).
172 mboxtest does the same.
173 This (or mboxtest) is the preferred way to test when possible: it
174 makes best use of limited data, and interpreting results is
175 straightforward.
176
177 timtest.py
178 A concrete test driver like mboxtest.py, but working with "a standard"
179 test data setup (see below). This runs an NxN test grid, skipping
180 the diagonal.
181 N classifiers are built.
182 N-1 runs are done with each classifier.
183 Each classifier is trained on 1 set, and predicts against each of
184 the N-1 remaining sets (those not used to train the classifier).
185 This is a much harder test than timcv, because it trains on N-1 times
186 less data, and makes each classifier predict against N-1 times
187 more data than it's been taught about.
188 It's harder to interpret the results of timtest (than timcv) correctly,
189 because each msg is predicted against N-1 times overall. So, e.g.,
190 one terribly difficult spam or ham can count against you N-1 times.
191
192
193 Test Utilities
194 ==============
195 rates.py
196 Scans the output (so far) produced by TestDriver.Drive(), and captures
197 summary statistics.
198
199 cmp.py
200 Given two summary files produced by rates.py, displays an account
201 of all the f-p and f-n rates side-by-side, along with who won which
202 (etc), the change in total # of unique false positives and negatives,
203 and the change in average f-p and f-n rates.
204
205 table.py
206 Summarizes the high-order bits from any number of summary files,
207 in a compact table.
208
209 fpfn.py
210 Given one or more TestDriver output files, prints list of false
211 positive and false negative filenames, one per line.
212
213
214 Test Data Utilities
215 ===================
216 cleanarch
217 A script to repair mbox archives by finding "Unix From" lines that
218 should have been escaped, and escaping them.
219
220 unheader.py
221 A script to remove unwanted headers from an mbox file. This is mostly
222 useful to delete headers which incorrectly might bias the results.
223 In default mode, this is similar to 'spamassassin -d', but much, much
224 faster.
225
226 loosecksum.py
227 A script to calculate a "loose" checksum for a message. See the text of
228 the script for an operational definition of "loose".
229
230 rebal.py
231 Evens out the number of messages in "standard" test data folders (see
232 below). Needs generalization (e.g., Ham and 4000 are hardcoded now).
233
234 mboxcount.py
235 Count the number of messages (both parseable and unparseable) in
236 mbox archives.
237
238 split.py
239 splitn.py
240 Split an mbox into random pieces in various ways. Tim recommends
241 using "the standard" test data set up instead (see below).
242
243 splitndirs.py
244 Like splitn.py (above), but splits an mbox into one message per file in
245 "the standard" directory structure (see below). This does an
246 approximate split; rebal.py (above) can be used afterwards to even out
247 the number of messages per folder.
248
249 runtest.sh
250 A Bourne shell script (for Unix) which will run some test or other.
251 I (Neale) will try to keep this updated to test whatever Tim is
252 currently asking for. The idea is, if you have a standard directory
253 structure (below), you can run this thing, go have some tea while it
254 works, then paste the output to the SpamBayes list for good karma.
255
256
257 Standard Test Data Setup
258 ========================
259 Barry gave Tim mboxes, but the spam corpus he got off the web had one spam
260 per file, and it only took two days of extreme pain to realize that one msg
261 per file is enormously easier to work with when testing: you want to split
262 these at random into random collections, you may need to replace some at
263 random when testing reveals spam mistakenly called ham (and vice versa),
264 etc -- even pasting examples into email is much easier when it's one msg
265 per file (and the test drivers make it easy to print a msg's file path).
266
267 The directory structure under my spambayes directory looks like so:
268
269 Data/
270 Spam/
271 Set1/ (contains 1375 spam .txt files)
272 Set2/ ""
273 Set3/ ""
274 Set4/ ""
275 Set5/ ""
276 Set6/ ""
277 Set7/ ""
278 Set9/ ""
279 Set9/ ""
280 Set10/ ""
281 reservoir/ (contains "backup spam")
282 Ham/
283 Set1/ (contains 2000 ham .txt files)
284 Set2/ ""
285 Set3/ ""
286 Set4/ ""
287 Set5/ ""
288 Set6/ ""
289 Set7/ ""
290 Set8/ ""
291 Set9/ ""
292 Set10/ ""
293 reservoir/ (contains "backup ham")
294
295 Every file at the deepest level is used (not just files with .txt
296 extensions). The files don't need to have a "Unix From"
297 header before the RFC-822 message (i.e. a line of the form "From
298 <address> <date>").
299
300 If you use the same names and structure, huge mounds of the tedious testing
301 code will work as-is. The more Set directories the merrier, although you
302 want at least a few hundred messages in each one. The "reservoir"
303 directories contain a few thousand other random hams and spams. When a ham
304 is found that's really spam, move it into a spam directory, then use the
305 rebal.py utility to rebalance the Set directories moving random message(s)
306 into and/or out of the reservoir directories. The reverse works as well
307 (finding ham in your spam directories).
308
309 The hams are 20,000 msgs selected at random from a python-list archive.
310 The spams are essentially all of Bruce Guenter's 2002 spam archive:
311
312 <http://www.em.ca/~bruceg/spam/>
313
314 The sets are grouped into pairs in the obvious way: Spam/Set1 with
315 Ham/Set1, and so on. For each such pair, timtest trains a classifier on
316 that pair, then runs predictions on each of the other pairs. In effect,
317 it's a NxN test grid, skipping the diagonal. There's no particular reason
318 to avoid predicting against the same set trained on, except that it
319 takes more time and seems the least interesting thing to try.
320
321 Later, support for N-fold cross validation testing was added, which allows
322 more accurate measurement of error rates with smaller amounts of training
323 data. That's recommended now. timcv.py is to cross-validation testing
324 as the older timtest.py is to grid testing. timcv.py has grown additional
325 arguments to allow using only a random subset of messages in each Set.
326
327 CAUTION: The partitioning of your corpora across directories should
328 be random. If it isn't, bias creeps in to the test results. This is
329 usually screamingly obvious under the NxN grid method (rates vary by a
330 factor of 10 or more across training sets, and even within runs against
331 a single training set), but harder to spot using N-fold c-v.
332
333 Testing a change and posting the results
334 ========================================
335
336 (Adapted from clues Tim posted on the spambayes and spambayes-dev lists)
337
338 Firstly, setup your data as above; it's really not worth the hassle to
339 come up with a different scheme. If you use the Outlook plug-in, the
340 export.py script in the Outlook2000 directory will export all the spam
341 and ham in your 'training' folders for you into this format (or close
342 enough).
343
344 Basically the idea is that you should have 10 sets of data, each with
345 200 to 500 messages in them. Obviously if you're testing something to
346 do with the size of a corpus, you'll want to change that. You then want
347 to run
348 timcv.py -n 10 > std.txt
349 (call std.txt whatever you like), and then
350 rates.py std.txt
351 You end up with two files, std.txt, which has the raw results, and stds.txt,
352 which has more of a summary of the results.
353
354 Now make the change to the code or options, and repeat the process,
355 giving the files different names (note that rates.py will automatically
356 choose the name for the output file, based on the input one).
357
358 You've now got the data you need, but you have to interpret it. The
359 simplest way of all is just to post it to spambayes-dev@python.org and let
360 someone else do it for you <wink>. The data you should post is the output of
361 cmp.py stds.txt alts.txt
362 along with the output of
363 table.py stds.txt alts.txt
364 (note that these just print to stdout).
365
366 Other information you can find in the 'raw' output (std.txt, above) are
367 histograms of the ham/spam spread, and a copy of the options settings.
368
369 Interpreting cmp.py output
370 --------------------------
371
372 (Using an example from Tim on spambayes-dev)
373
374 > cv_octs.txt -> cv_oct_subjs.txt
375 > -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams
376 > -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams
377 > -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams
378 > -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams
379 > -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams
380 > -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams
381 > -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams
382 > -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams
383 > -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams
384 > -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams
385 >
386 > false positive percentages
387 > 0.000 0.000 tied
388 > 0.000 0.000 tied
389 > 0.000 0.000 tied
390 > 0.000 0.000 tied
391 > 0.219 0.219 tied
392 >
393 > won 0 times
394 > tied 5 times
395 > lost 0 times
396
397 So all 5 runs tied on FP. That tells us much more than that the *net*
398 effect across 5 runs was nil on FP: it tells us that there are no hidden
399 glitches hiding behind that "net nothing" -- it was no change across the board.
400
401 > total unique fp went from 1 to 1 tied
402 > mean fp % went from 0.0437636761488 to 0.0437636761488 tied
403 >
404 > false negative percentages
405 > 2.007 2.007 tied
406 > 1.390 1.390 tied
407 > 1.622 1.622 tied
408 > 2.029 1.917 won -5.52%
409 > 2.703 2.477 won -8.36%
410 >
411 > won 2 times
412 > tied 3 times
413 > lost 0 times
414
415 When evaluating a small change, I'm heartened to see that in no run did it lose.
416 At worst it tied, and twice it helped a little. That's encouraging.
417
418 What the histograms would tell us that we can't tell from this is whether you
419 could have done just as well without the change by raising your ham cutoff a little.
420 That would also tie on FP, and *may* also get rid of the same number (or even
421 more) of FN.
422
423 > total unique fn went from 86 to 83 won -3.49%
424 > mean fn % went from 1.95029003772 to 1.88269707836 won -3.47%
425 >
426 > ham mean ham sdev
427 > 0.57 0.58 +1.75% 4.63 4.77 +3.02%
428 > 0.08 0.07 -12.50% 1.20 1.01 -15.83%
429 > 0.36 0.29 -19.44% 3.61 3.23 -10.53%
430 > 0.08 0.11 +37.50% 0.89 1.18 +32.58%
431 > 0.72 0.76 +5.56% 6.80 7.06 +3.82%
432 >
433 > ham mean and sdev for all runs
434 > 0.37 0.37 +0.00% 4.10 4.16 +1.46%
435
436 That's a good example of grand averages hiding the truth: the averaged change
437 in the mean ham score was 0 across all 5 runs, but *within* the 5 runs it slobbered
438 around wildly, from decreasing 20% to increasing 40%(!).
439
440 > spam mean spam sdev
441 > 96.43 96.44 +0.01% 15.89 15.89 +0.00%
442 > 97.01 97.07 +0.06% 13.79 13.70 -0.65%
443 > 97.14 97.16 +0.02% 14.05 14.02 -0.21%
444 > 96.52 96.56 +0.04% 15.65 15.52 -0.83%
445 > 95.53 95.63 +0.10% 17.47 17.31 -0.92%
446 >
447 > spam mean and sdev for all runs
448 > 96.52 96.57 +0.05% 15.46 15.37 -0.58%
449
450 That's good to see: it's a consistent win for spam scores across runs,
451 although an almost imperceptible one. It's good when the mean spam score rises,
452 and it's good when sdev (for ham or spam) decreases.
453
454 > ham/spam mean difference: 96.15 96.20 +0.05
455
456 This is a slight win for the chance, although seeing the details gives cause
457 to worry some about the effect on ham: the ham sdev increased overall, and
458 the effects on ham mean and ham sdev varied wildly across runs. OTOH, the
459 "before" numbers for ham mean and ham sdev varied wildly across runs already.
460 That gives cause to worry some about the data <wink>.
461
462
463 Making a source release
464 =======================
465
466 Source releases are built with distutils. Here's how I (Richie) have been
467 building them. I do this on a Windows box, partly so that the zip release
468 can have Windows line endings without needing to run a conversion script.
469 I don't think that's actually necessary, because everything would work on
470 Windows even with Unix line endings, but you couldn't load the files into
471 Notepad and sometimes it's convenient to do so. End users might not even
472 have any other text editor, so it make things like the README unREADable.
473 8-)
474
475 Anthony would rather eat live worms than trying to get a sane environment
476 on Windows, so his approach to building the zip file is at the end.
477
478 o If any new file types have been added since last time (eg. 1.0a5 went
479 out without the Windows .rc and .h files) then add them to MANIFEST.in.
480 If there are any new scripts or packages, add them to setup.py. Test
481 these changes (by building source packages according to the instructions
482 below) then commit your edits.
483 o Checkout the 'spambayes' module twice, once with Windows line endings
484 and once with Unix line endings (I use WinCVS for this, using "Admin /
485 Preferences / Globals / Checkout text files with the Unix LF". If you
486 use TortoiseCVS, like Tony, then the option is on the Options tab in
487 the checkout dialog).
488 o Change spambayes/__init__.py to contain the new version number but don't
489 commit it yet, just in case something goes wrong.
490 o In the Windows checkout, run "python setup.py sdist --formats zip"
491 o In the Unix checkout, run "python setup.py sdist --formats gztar"
492 o Take the resulting spambayes-1.0a5.zip and spambayes-1.0a5.tar.gz, and
493 test the former on Windows (ideally in a freshly-installed Python
494 environment; I keep a VMWare snapshot of a clean Windows installation
495 for this, but that's probably overkill 8-) and test the latter on Unix
496 (a Debian VMWare box in my case).
497 o If you can, rename these with "rc" at the end, and make them available
498 to the spambayes-dev crowd as release candidates. If all is OK, then
499 fix the names (or redo this) and keep going.
500 o Dance the SourceForge release dance:
501 http://sourceforge.net/docman/display_doc.php?docid=6445&group_id=1#filereleasesteps
502 When it comes to the "what's new" and the ChangeLog, I cut'n'paste the
503 relevant pieces of WHAT_IS_NEW.txt and CHANGELOG.txt into the form, and
504 check the "Keep my preformatted text" checkbox.
505 o Now commit spambayes/__init__.py and tag the whole checkout - see the
506 existing tag names for the tag name format.
507 o Update the website News, Download, Windows and Application sections.
508 o Update reply.txt in the website repository as needed (it specifies the
509 latest version). Then let Tim, Barry, Tony, or Skip know that they need to
510 update the autoresponder.
511
512 Then announce the release on the mailing lists and watch the bug reports
513 roll in. 8-)
514
515 Anthony's Alternate Approach to Building the Zipfile
516
517 o Unpack the tarball somewhere, making a spambayes-1.0a7 directory
518 (version number will obviously change in future releases)
519 o Run the following two commands:
520
521 find spambayes-1.0a7 -type f -name '*.txt' | xargs zip -l sb107.zip
522 find spambayes-1.0a7 -type f \! -name '*.txt' | xargs zip sb107.zip
523
524 o This makes a tarball where the .txt files are mangled, but everything
525 else is left alone.
526
527 Making a binary release
528 =======================
529
530 The binary release includes both sb_server and the Outlook plug-in and
531 is an installer for Windows (98 and above) systems. In order to have
532 COM typelibs that work with Outlook 2000, 2002 and 2003, you need to
533 build the installer on a system that has Outlook 2000 (not a more recent
534 version). You also need to have InnoSetup, resourcepackage and py2exe
535 installed.
536
537 o Get hold of a fresh copy of the source (Windows line endings,
538 presumably).
539 o Run sb_server and open the web interface. This gets resourcepackage
540 to generate the needed files.
541 o Replace the __init__.py file in spambayes/spambayes/resources with
542 a blank file to disable resourcepackage.
543 o Ensure that the version numbers in spambayes/spambayes/__init__.py
544 and spambayes/spambayes/Version.py are up-to-date.
545 o Ensure that you don't have any other copies of spambayes in your
546 PYTHONPATH, or py2exe will pick these up! If in doubt, run
547 setup.py install.
548 o Run the "setup_all.py" script in the spambayes/windows/py2exe/
549 directory. This uses py2exe to create the files that Inno will install.
550 o Open (in InnoSetup) the spambayes.iss file in the spambayes/windows/
551 directory. Change the version number in the AppVerName and
552 OutputBaseFilename lines to the new number.
553 o Compile the spambayes.iss script to get the executable.
554 o You can now follow the steps in the source release description above,
555 from the testing step.