"SfR Fresh" - the SfR Freeware/Shareware Archive

Member "spambayes-1.0.4/README-DEVEL.txt" of archive spambayes-1.0.4.zip:


As a special service "SfR Fresh" has tried to format the requested source page into HTML format using source code syntax highlighting with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file. That can be also achieved for any archive member file by clicking within an archive contents listing on the first character of the file(path) respectively on the according byte size field.
    1 Copyright (C) 2002 Python Software Foundation; All Rights Reserved
    2 
    3 The Python Software Foundation (PSF) holds copyright on all material
    4 in this project.  You may use it under the terms of the PSF license;
    5 see LICENSE.txt.
    6 
    7 
    8 Assorted clues.
    9 
   10 
   11 What's Here?
   12 ============
   13 Lots of mondo cool partially documented code.  What else could there be <wink>?
   14 
   15 The focus of this project so far has not been to produce the fastest or
   16 smallest filters, but to set up a flexible pure-Python implementation
   17 for doing algorithm research.  Lots of people are making fast/small
   18 implementations, and it takes an entirely different kind of effort to
   19 make genuine algorithm improvements.  I think we've done quite well at
   20 that so far.  The focus of this codebase may change to small/fast
   21 later -- as is, the false positive rate has gotten too small to measure
   22 reliably across test sets with 4000 hams + 2750 spams, and the f-n rate
   23 has also gotten too small to measure reliably across that much training data.
   24 
   25 The code in this project requires Python 2.2 (or later).
   26 
   27 You should definitely check out the FAQ:
   28 http://spambayes.org/faq.html
   29 
   30 
   31 Primary Core Files
   32 ==================
   33 Options.py
   34     Uses ConfigParser to allow fiddling various aspects of the classifier,
   35     tokenizer, and test drivers.  Create a file named bayescustomize.ini to
   36     alter the defaults.  Modules wishing to control aspects of their
   37     operation merely do
   38 
   39         from Options import options
   40 
   41     near the start, and consult attributes of options.  To see what options
   42     are available, import Options.py and do
   43 
   44         print Options.options.display_full()
   45 
   46     This will print out a detailed description of each option, the allowed
   47     values, and so on.  (You can pass in a section or section and option
   48     name to display_full if you don't want the whole list).
   49 
   50     As an alternative to bayescustomize.ini, you can set the environment
   51     variable BAYESCUSTOMIZE to a list of one or more .ini files, these will
   52     be read in, in order, and applied to the options. This allows you to
   53     tweak individual runs by combining fragments of .ini files.  The
   54     character used to separate different .ini files is platform-dependent.
   55     On Unix, Linux and Mac OS X systems it is ':'.  On Windows it is ';'.
   56     On Mac OS 9 and earlier systems it is a NL character.
   57 
   58     *NOTE* The separator character changed after the second alpha version of
   59     the first release.  Previously, if multiple files were specified in
   60     BAYESCUSTOMIZE they were space-separated.
   61 
   62 classifier.py
   63     The classifier, which is the soul of the method.
   64 
   65 tokenizer.py
   66     An implementation of tokenize() that Tim can't seem to help but keep
   67     working on <wink>.  Generates a token stream from a message, which
   68     the classifier trains on or predicts against.
   69 
   70 chi2.py
   71     A collection of statistics functions.
   72 
   73 IMPORTANT NOTE
   74 ==============
   75 
   76 The applications have all been renamed in preparation for 1.0 - the
   77 following section refers to old application names.
   78 
   79 IMPORTANT NOTE
   80 ==============
   81 
   82 The applications have all been renamed in preparation for 1.0 - the
   83 following section refers to old application names.
   84 
   85 Apps
   86 ====
   87 hammie.py
   88     A spamassassin-like filter which uses tokenizer and classifier (above).
   89 
   90 hammiefilter.py
   91     A simpler hammie front-end that doesn't print anything.  Useful for
   92     procmail filtering and scoring from your MUA.
   93 
   94 mboxtrain.py
   95     Trainer for Maildir, MH, or mbox mailboxes.  Remembers which
   96     messages it saw the last time you ran it, and will only train on new
   97     messages or messages which should be retrained.  
   98 
   99     The idea is to run this automatically every night on your Inbox and
  100     Spam folders, and then sort misclassified messages by hand.  This
  101     will work with any IMAP4 mail client, or any client running on the
  102     server.
  103 
  104 pop3proxy.py
  105     A spam-classifying POP3 proxy.  It adds a spam-judgment header to
  106     each mail as it's retrieved, so you can use your email client's
  107     filters to deal with them without needing to fiddle with your email
  108     delivery system.
  109 
  110     Also acts as a web server providing a user interface that allows you
  111     to train the classifier, classify messages interactively, and query
  112     the token database.  This piece will at some point be split out into
  113     a separate module.
  114 
  115 smtpproxy.py
  116    A message training SMTP proxy.  It sits between your email client and
  117    your SMTP server and intercepts mail to set ham and spam addresses.
  118    All other mail is simply passed through to the SMTP server.
  119 
  120 mailsort.py
  121     A delivery agent that uses a CDB of word probabilities and delivers
  122     a message to one of two Maildir message folders, depending on the
  123     classifier score.  Note that both Maildirs must be on the same
  124     device.
  125 
  126 hammiesrv.py
  127     A stab at making hammie into a client/server model, using XML-RPC.
  128 
  129 hammiecli.py
  130     A client for hammiesrv.
  131 
  132 imapfilter.py
  133     A spam-classifying and training application for use with IMAP servers.
  134     You can specify folders that contain mail to train as ham/spam, and
  135     folders that contain mail to classify, and the filter will do so.
  136     Note that this is currently in very early development and not
  137     recommended for production use.
  138 
  139 
  140 Test Driver Core
  141 ================
  142 Tester.py
  143     A test-driver class that feeds streams of msgs to a classifier
  144     instance, and keeps track of right/wrong percentages and lists
  145     of false positives and false negatives.
  146 
  147 TestDriver.py
  148     A flexible higher layer of test helpers, building on Tester above.
  149     For example, it's usable for building simple test drivers, NxN test
  150     grids, and N-fold cross-validation drivers.  See also rates.py,
  151     cmp.py, and table.py below.
  152 
  153 msgs.py
  154     Some simple classes to wrap raw msgs, and to produce streams of
  155     msgs.  The test drivers use these.
  156 
  157 
  158 Concrete Test Drivers
  159 =====================
  160 mboxtest.py
  161     A concrete test driver like timtest.py, but working with a pair of
  162     mailbox files rather than the specialized timtest setup.
  163 
  164 timcv.py
  165     An N-fold cross-validating test driver.  Assumes "a standard" data
  166         directory setup (see below)) rather than the specialized mboxtest
  167         setup.
  168     N classifiers are built.
  169     1 run is done with each classifier.
  170     Each classifier is trained on N-1 sets, and predicts against the sole
  171         remaining set (the set not used to train the classifier).
  172     mboxtest does the same.
  173     This (or mboxtest) is the preferred way to test when possible:  it
  174         makes best use of limited data, and interpreting results is
  175         straightforward.
  176 
  177 timtest.py
  178     A concrete test driver like mboxtest.py, but working with "a standard"
  179         test data setup (see below).  This runs an NxN test grid, skipping
  180         the diagonal.
  181     N classifiers are built.
  182     N-1 runs are done with each classifier.
  183     Each classifier is trained on 1 set, and predicts against each of
  184         the N-1 remaining sets (those not used to train the classifier).
  185     This is a much harder test than timcv, because it trains on N-1 times
  186         less data, and makes each classifier predict against N-1 times
  187         more data than it's been taught about.
  188     It's harder to interpret the results of timtest (than timcv) correctly,
  189         because each msg is predicted against N-1 times overall.  So, e.g.,
  190         one terribly difficult spam or ham can count against you N-1 times.
  191 
  192 
  193 Test Utilities
  194 ==============
  195 rates.py
  196     Scans the output (so far) produced by TestDriver.Drive(), and captures
  197     summary statistics.
  198 
  199 cmp.py
  200     Given two summary files produced by rates.py, displays an account
  201     of all the f-p and f-n rates side-by-side, along with who won which
  202     (etc), the change in total # of unique false positives and negatives,
  203     and the change in average f-p and f-n rates.
  204 
  205 table.py
  206     Summarizes the high-order bits from any number of summary files,
  207     in a compact table.
  208 
  209 fpfn.py
  210     Given one or more TestDriver output files, prints list of false
  211     positive and false negative filenames, one per line.
  212 
  213 
  214 Test Data Utilities
  215 ===================
  216 cleanarch
  217     A script to repair mbox archives by finding "Unix From" lines that
  218     should have been escaped, and escaping them.
  219 
  220 unheader.py
  221     A script to remove unwanted headers from an mbox file.  This is mostly
  222     useful to delete headers which incorrectly might bias the results.
  223     In default mode, this is similar to 'spamassassin -d', but much, much
  224     faster.
  225 
  226 loosecksum.py
  227     A script to calculate a "loose" checksum for a message.  See the text of
  228     the script for an operational definition of "loose".
  229 
  230 rebal.py
  231     Evens out the number of messages in "standard" test data folders (see
  232     below).  Needs generalization (e.g., Ham and 4000 are hardcoded now).
  233 
  234 mboxcount.py
  235     Count the number of messages (both parseable and unparseable) in
  236     mbox archives.
  237 
  238 split.py
  239 splitn.py
  240     Split an mbox into random pieces in various ways.  Tim recommends
  241     using "the standard" test data set up instead (see below).
  242 
  243 splitndirs.py
  244     Like splitn.py (above), but splits an mbox into one message per file in
  245     "the standard" directory structure (see below).  This does an
  246     approximate split; rebal.py (above) can be used afterwards to even out
  247     the number of messages per folder.
  248 
  249 runtest.sh
  250     A Bourne shell script (for Unix) which will run some test or other.
  251     I (Neale) will try to keep this updated to test whatever Tim is
  252     currently asking for.  The idea is, if you have a standard directory
  253     structure (below), you can run this thing, go have some tea while it
  254     works, then paste the output to the SpamBayes list for good karma.
  255 
  256 
  257 Standard Test Data Setup
  258 ========================
  259 Barry gave Tim mboxes, but the spam corpus he got off the web had one spam
  260 per file, and it only took two days of extreme pain to realize that one msg
  261 per file is enormously easier to work with when testing:  you want to split
  262 these at random into random collections, you may need to replace some at
  263 random when testing reveals spam mistakenly called ham (and vice versa),
  264 etc -- even pasting examples into email is much easier when it's one msg
  265 per file (and the test drivers make it easy to print a msg's file path).
  266 
  267 The directory structure under my spambayes directory looks like so:
  268 
  269 Data/
  270     Spam/
  271         Set1/ (contains 1375 spam .txt files)
  272         Set2/            ""
  273         Set3/            ""
  274         Set4/            ""
  275         Set5/            ""
  276         Set6/            ""
  277         Set7/            ""
  278         Set9/            ""
  279         Set9/            ""
  280         Set10/           ""
  281 	reservoir/ (contains "backup spam")
  282     Ham/
  283         Set1/ (contains 2000 ham .txt files)
  284         Set2/            ""
  285         Set3/            ""
  286         Set4/            ""
  287         Set5/            ""
  288         Set6/            ""
  289         Set7/            ""
  290         Set8/            ""
  291         Set9/            ""
  292         Set10/           ""
  293         reservoir/ (contains "backup ham")
  294 
  295 Every file at the deepest level is used (not just files with .txt
  296 extensions).  The files don't need to have a "Unix From"
  297 header before the RFC-822 message (i.e. a line of the form "From
  298 <address> <date>").
  299 
  300 If you use the same names and structure, huge mounds of the tedious testing
  301 code will work as-is.  The more Set directories the merrier, although you
  302 want at least a few hundred messages in each one.  The "reservoir"
  303 directories contain a few thousand other random hams and spams.  When a ham
  304 is found that's really spam, move it into a spam directory, then use the
  305 rebal.py utility to rebalance the Set directories moving random message(s)
  306 into and/or out of the reservoir directories.  The reverse works as well
  307 (finding ham in your spam directories).
  308 
  309 The hams are 20,000 msgs selected at random from a python-list archive.
  310 The spams are essentially all of Bruce Guenter's 2002 spam archive:
  311 
  312     <http://www.em.ca/~bruceg/spam/>
  313 
  314 The sets are grouped into pairs in the obvious way:  Spam/Set1 with
  315 Ham/Set1, and so on.  For each such pair, timtest trains a classifier on
  316 that pair, then runs predictions on each of the other pairs.  In effect,
  317 it's a NxN test grid, skipping the diagonal.  There's no particular reason
  318 to avoid predicting against the same set trained on, except that it
  319 takes more time and seems the least interesting thing to try.
  320 
  321 Later, support for N-fold cross validation testing was added, which allows
  322 more accurate measurement of error rates with smaller amounts of training
  323 data.  That's recommended now.  timcv.py is to cross-validation testing
  324 as the older timtest.py is to grid testing.  timcv.py has grown additional
  325 arguments to allow using only a random subset of messages in each Set.
  326 
  327 CAUTION:  The partitioning of your corpora across directories should
  328 be random.  If it isn't, bias creeps in to the test results.  This is
  329 usually screamingly obvious under the NxN grid method (rates vary by a
  330 factor of 10 or more across training sets, and even within runs against
  331 a single training set), but harder to spot using N-fold c-v.
  332 
  333 Testing a change and posting the results
  334 ========================================
  335 
  336 (Adapted from clues Tim posted on the spambayes and spambayes-dev lists)
  337 
  338 Firstly, setup your data as above; it's really not worth the hassle to
  339 come up with a different scheme.  If you use the Outlook plug-in, the
  340 export.py script in the Outlook2000 directory will export all the spam
  341 and ham in your 'training' folders for you into this format (or close
  342 enough).
  343 
  344 Basically the idea is that you should have 10 sets of data, each with
  345 200 to 500 messages in them.  Obviously if you're testing something to
  346 do with the size of a corpus, you'll want to change that.  You then want
  347 to run
  348     timcv.py -n 10 > std.txt
  349 (call std.txt whatever you like), and then
  350     rates.py std.txt
  351 You end up with two files, std.txt, which has the raw results, and stds.txt,
  352 which has more of a summary of the results.
  353 
  354 Now make the change to the code or options, and repeat the process,
  355 giving the files different names (note that rates.py will automatically
  356 choose the name for the output file, based on the input one).
  357 
  358 You've now got the data you need, but you have to interpret it.  The
  359 simplest way of all is just to post it to spambayes-dev@python.org and let
  360 someone else do it for you <wink>.  The data you should post is the output of
  361     cmp.py stds.txt alts.txt
  362 along with the output of
  363     table.py stds.txt alts.txt
  364 (note that these just print to stdout).
  365 
  366 Other information you can find in the 'raw' output (std.txt, above) are
  367 histograms of the ham/spam spread, and a copy of the options settings.
  368 
  369 Interpreting cmp.py output
  370 --------------------------
  371 
  372 (Using an example from Tim on spambayes-dev)
  373 
  374 > cv_octs.txt -> cv_oct_subjs.txt
  375 > -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams 
  376 > -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams 
  377 > -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams 
  378 > -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams 
  379 > -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams 
  380 > -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams 
  381 > -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams 
  382 > -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams 
  383 > -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams 
  384 > -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams
  385 >
  386 > false positive percentages
  387 >     0.000  0.000  tied
  388 >     0.000  0.000  tied
  389 >     0.000  0.000  tied
  390 >     0.000  0.000  tied
  391 >     0.219  0.219  tied
  392 >
  393 > won   0 times
  394 > tied  5 times
  395 > lost  0 times
  396 
  397 So all 5 runs tied on FP.  That tells us much more than that the *net*
  398 effect across 5 runs was nil on FP:  it tells us that there are no hidden
  399 glitches hiding behind that "net nothing" -- it was no change across the board.
  400 
  401 > total unique fp went from 1 to 1 tied
  402 > mean fp % went from 0.0437636761488 to 0.0437636761488 tied
  403 >
  404 > false negative percentages
  405 >     2.007  2.007  tied
  406 >     1.390  1.390  tied
  407 >     1.622  1.622  tied
  408 >     2.029  1.917  won     -5.52%
  409 >     2.703  2.477  won     -8.36%
  410 >
  411 > won   2 times
  412 > tied  3 times
  413 > lost  0 times
  414 
  415 When evaluating a small change, I'm heartened to see that in no run did it lose.
  416 At worst it tied, and twice it helped a little.  That's encouraging.
  417 
  418 What the histograms would tell us that we can't tell from this is whether you
  419 could have done just as well without the change by raising your ham cutoff a little.
  420 That would also tie on FP, and *may* also get rid of the same number (or even
  421 more) of FN.
  422 
  423 > total unique fn went from 86 to 83 won     -3.49%
  424 > mean fn % went from 1.95029003772 to 1.88269707836 won     -3.47%
  425 >
  426 > ham mean                     ham sdev
  427 >    0.57    0.58   +1.75%        4.63    4.77   +3.02%
  428 >    0.08    0.07  -12.50%        1.20    1.01  -15.83%
  429 >    0.36    0.29  -19.44%        3.61    3.23  -10.53%
  430 >    0.08    0.11  +37.50%        0.89    1.18  +32.58%
  431 >    0.72    0.76   +5.56%        6.80    7.06   +3.82%
  432 >
  433 > ham mean and sdev for all runs
  434 >    0.37    0.37   +0.00%        4.10    4.16   +1.46%
  435 
  436 That's a good example of grand averages hiding the truth:  the averaged change
  437 in the mean ham score was 0 across all 5 runs, but *within* the 5 runs it slobbered
  438 around wildly, from decreasing 20% to increasing 40%(!).
  439 
  440 > spam mean                    spam sdev
  441 >   96.43   96.44   +0.01%       15.89   15.89   +0.00%
  442 >   97.01   97.07   +0.06%       13.79   13.70   -0.65%
  443 >   97.14   97.16   +0.02%       14.05   14.02   -0.21%
  444 >   96.52   96.56   +0.04%       15.65   15.52   -0.83%
  445 >   95.53   95.63   +0.10%       17.47   17.31   -0.92%
  446 >
  447 > spam mean and sdev for all runs
  448 >   96.52   96.57   +0.05%       15.46   15.37   -0.58%
  449 
  450 That's good to see:  it's a consistent win for spam scores across runs,
  451 although an almost imperceptible one.  It's good when the mean spam score rises,
  452 and it's good when sdev (for ham or spam) decreases.
  453 
  454 > ham/spam mean difference: 96.15 96.20 +0.05
  455 
  456 This is a slight win for the chance, although seeing the details gives cause
  457 to worry some about the effect on ham:  the ham sdev increased overall, and
  458 the effects on ham mean and ham sdev varied wildly across runs.  OTOH, the
  459 "before" numbers for ham mean and ham sdev varied wildly across runs already.
  460 That gives cause to worry some about the data <wink>.
  461 
  462 
  463 Making a source release
  464 =======================
  465 
  466 Source releases are built with distutils.  Here's how I (Richie) have been
  467 building them.  I do this on a Windows box, partly so that the zip release
  468 can have Windows line endings without needing to run a conversion script.
  469 I don't think that's actually necessary, because everything would work on
  470 Windows even with Unix line endings, but you couldn't load the files into
  471 Notepad and sometimes it's convenient to do so.  End users might not even
  472 have any other text editor, so it make things like the README unREADable.
  473 8-)
  474 
  475 Anthony would rather eat live worms than trying to get a sane environment
  476 on Windows, so his approach to building the zip file is at the end.
  477 
  478  o If any new file types have been added since last time (eg. 1.0a5 went
  479    out without the Windows .rc and .h files) then add them to MANIFEST.in.
  480    If there are any new scripts or packages, add them to setup.py.  Test
  481    these changes (by building source packages according to the instructions
  482    below) then commit your edits.
  483  o Checkout the 'spambayes' module twice, once with Windows line endings
  484    and once with Unix line endings (I use WinCVS for this, using "Admin /
  485    Preferences / Globals / Checkout text files with the Unix LF".  If you
  486    use TortoiseCVS, like Tony, then the option is on the Options tab in
  487    the checkout dialog).
  488  o Change spambayes/__init__.py to contain the new version number but don't
  489    commit it yet, just in case something goes wrong.
  490  o In the Windows checkout, run "python setup.py sdist --formats zip"
  491  o In the Unix checkout, run "python setup.py sdist --formats gztar"
  492  o Take the resulting spambayes-1.0a5.zip and spambayes-1.0a5.tar.gz, and
  493    test the former on Windows (ideally in a freshly-installed Python
  494    environment; I keep a VMWare snapshot of a clean Windows installation
  495    for this, but that's probably overkill 8-) and test the latter on Unix
  496    (a Debian VMWare box in my case).
  497  o If you can, rename these with "rc" at the end, and make them available
  498    to the spambayes-dev crowd as release candidates.  If all is OK, then
  499    fix the names (or redo this) and keep going.
  500  o Dance the SourceForge release dance:
  501    http://sourceforge.net/docman/display_doc.php?docid=6445&group_id=1#filereleasesteps
  502    When it comes to the "what's new" and the ChangeLog, I cut'n'paste the
  503    relevant pieces of WHAT_IS_NEW.txt and CHANGELOG.txt into the form, and
  504    check the "Keep my preformatted text" checkbox.
  505  o Now commit spambayes/__init__.py and tag the whole checkout - see the
  506    existing tag names for the tag name format.
  507  o Update the website News, Download, Windows and Application sections.
  508  o Update reply.txt in the website repository as needed (it specifies the
  509    latest version).  Then let Tim, Barry, Tony, or Skip know that they need to
  510    update the autoresponder.
  511 
  512 Then announce the release on the mailing lists and watch the bug reports
  513 roll in.  8-)
  514 
  515 Anthony's Alternate Approach to Building the Zipfile
  516 
  517  o Unpack the tarball somewhere, making a spambayes-1.0a7 directory
  518    (version number will obviously change in future releases)
  519  o Run the following two commands:
  520 
  521      find spambayes-1.0a7 -type f -name '*.txt' | xargs zip -l sb107.zip 
  522      find spambayes-1.0a7 -type f \! -name '*.txt' | xargs zip sb107.zip 
  523 
  524  o This makes a tarball where the .txt files are mangled, but everything
  525    else is left alone.
  526 
  527 Making a binary release
  528 =======================
  529 
  530 The binary release includes both sb_server and the Outlook plug-in and
  531 is an installer for Windows (98 and above) systems.  In order to have
  532 COM typelibs that work with Outlook 2000, 2002 and 2003, you need to
  533 build the installer on a system that has Outlook 2000 (not a more recent
  534 version).  You also need to have InnoSetup, resourcepackage and py2exe
  535 installed.
  536 
  537  o Get hold of a fresh copy of the source (Windows line endings,
  538    presumably).
  539  o Run sb_server and open the web interface.  This gets resourcepackage
  540    to generate the needed files.
  541  o Replace the __init__.py file in spambayes/spambayes/resources with
  542    a blank file to disable resourcepackage.
  543  o Ensure that the version numbers in spambayes/spambayes/__init__.py
  544    and spambayes/spambayes/Version.py are up-to-date.
  545  o Ensure that you don't have any other copies of spambayes in your
  546    PYTHONPATH, or py2exe will pick these up!  If in doubt, run
  547    setup.py install.
  548  o Run the "setup_all.py" script in the spambayes/windows/py2exe/
  549    directory. This uses py2exe to create the files that Inno will install.
  550  o Open (in InnoSetup) the spambayes.iss file in the spambayes/windows/
  551    directory.  Change the version number in the AppVerName and
  552    OutputBaseFilename lines to the new number.
  553  o Compile the spambayes.iss script to get the executable.
  554  o You can now follow the steps in the source release description above,
  555    from the testing step.