2007-april-30 * bufio has seen a MAJOR overhaul. It is now capable of pushing text & binary data to the file system at unprecedented rates. This is done by adding a variable sized (and possibly large) memory cache, resulting in large size I/O operations. These perform very much faster than the regular C RTL I/O calls. (tested on quad CPU UNIX Dell servers) the new bufio was required as I needed to log/track a huge amount of data in the shortest possible time / lowest possible CPU load. * cookie handling has been fixed/augmented. pavuk can now have the initial cookie values that go with a certain web request preconfigured on the commandline. Also, several bugs in handling the cookies have been fixed. (tested on a wicked ASP.NET intranet site which 'assumed' the use of a special web client (a TV set top box) which would transmit it's serial # as a client- side created(!) cookie to the web server. This site/client combo thus actually transmitted cookies which would first show up in a web _request_ instead of the usual: a server-side _respone_.) * several portability items have been changed (h_errno, ...) to make the code compile and work on the odd-flavored UNIX box. A native Win32 port is under way: it now works, inclusing zlib and OpenSSL, though the latter has not been tested recently. Note that the changes may have broken GTK support, as I was not able to build the code with GTK on my UNIX boxes. * socket I/O (IP traffic) has been fixed to properly cope with user breaks (a user hitting Ctrl+C). Several locations in the software where the unexpected signal would cause an infinite loop have been identified and fixed. * added several lines of DEBUG_xxx to aid both developer and user in tracking down hard to diagnose issues inside pavuk while scanning a site. * Accepted-Encoding (more specifically: the handling of x-gzip/gzip/x- compress/compress encoding) has been changed to allow for better portability: data is expanded in-memory, without the need for an external 'gzip' tool and/or OS-specific forks & pipes. (Win32 wouldn't know a fork if ever it saw one.) * ALL stdio is now handled through the new bufio system. This not only improves performance when you've got -debug and -debuglevel dialed all the way up, but also corrected several spots where, depending on your C RTL, stdio/stderr traffic would arrive at different moments on your console (some of it was written through the FILE I/O, some through direct I/O, causing blurbs of output to pass one another along the way to the actual console). * buffer overrun protection has been improved. Note also that every snprintf() and derivative thereof is now 'augmented' by an additional line of code which ensures that the last character in the buffer is guaranteed to be a NUL sentinel, thus ensuring that the buffer will always present data in correct C string format (NUL-terminated). (This is an old habit of mine as some C RTLs have shown to be kinda flaky on the subject of NUL sentinels when snprintf() et al are writing data up to the edge of their output buffers: some C RTLs 'forget' to put a NUL there under particular circumstances (some commercial Watcom compiler releases come to mind). * multithreading pavuk has been tested on an high perf MP UNIX box and it was like the documentation/notes state somewhere: instable. The thread interlocking has now been fixed; one of the hardest to fix proved to be the lockup at the end of a pavuk run. The fix also includes the use of semaphores are some additional code changes to make the code thread safe; critical sections are now handled as such. This includes placing several non-threadsafe C RTL calls (e.g. ctime()) inside critical sections! * auto-form-filling (the feature which led me to select pavuk over wget et al when I started the hammer/chunky project) has been fixed for those special pages where you have an empty form to submit: the site I had to test included such a form, which was submitted using javascript, but did not contain _any_ input fields (but cookies were expected to come with that request, thank you). Before, pavuk crashed on such a page. This has now been fixed. * added a 'reindent' target to the makefile, using GNU indent to reformat the code. (When you're working several weeks on end in crunch time, you want to see some proper and consistent looking source code, even when you just made it a mess yourself...) Also extended the cleanup makefile target to help me in cleaning up any backup and/or temporary files created by vi and some log diagnostic scripts. * added several commandline parameter types, which allow you to instruct pavuk to use OS file handles or file names for logging activity, while you can now also specify whether a log file should be overwritten (default) or appended to (new feature) by adding another '@' prefix to the file path. TODO: document this properly. * added hammer/crunchy modes: several ways to scan a web site and than rescan it. The higher (later) hammer mode has been specifically written to use pavuk as a 'replay attack' based DoS tool for testing high performance web servers. (bufio was overhauled to allow us to log all I/O data + diagnostics to disc while hammering the server while the pavuk system _must_ perform better (= faster) than the web server when running both on equivalent hardware.) * The native Win32 port has been overhauled (previous code was never released to the public) to make sure I did not have to look for OS- specific path elements _everywhere_ in the code (it was becomes a code- wise maintainance nightmare while fixing up/down all those 'absolute path' and 'path expansion' code sections to handle Win32 drive letters (root is '[A-Z]:[\\/]' instead of simply '/'). This has been fixed by using the cygwin 'path hack' for the native Win32 port too: root is '/cygdrive/[a-z]/' so it looks exactly like a UNIX path. Any places in the codes which need to address the OS while passing an OS- specific path are now handled almost invisibly: all relevant C RTL calls (fopen/open/stat/lstat/symlink/link/unlink/rename/mkdir/rmdir/opendir) are now encapsulated in tl_[sysname] wrapper functions where these /cygdrive/[x]/ paths are converted back to native Win32 paths before the actual C RTL function is called. Also any debug/print statement, which is used to report a file path, is fixed to convert file paths to the native representation with a minimum of fuss: see the new tl_native() call for a description how this was done. This code has not been tested in a UNIX/MP environment, but the design is such that this should not cause any trouble (pthread port for Win32 is in progress ATM). * added -debug_level modes: all/trace/dev/bufio/cookie/htmlform. Also added a feature where you can now specify a set of debug levels and have some of those levels _removed_, e.g. 'all,!dev' will show anything _except_ 'dev' level debug output: note the new '!' prefix. * -debug_level output is now prefixed with its level in caps and square brackets, e.g. '[PROCE]' to aid in filtering the debug output (for instance by piping it through sed/grep). * unified debug output handling in the code: -debug_levels are now only active when you specify -debug too. * inflate_decode() and gzip_decode() have been fixed to suit a multithreaded environment. gzip_decode() now has an in-memory implementation, using the zlib library, for those systems which do not support UNIX pipes/forks. * Fixed deflate/compress handling: the MJF Accept-Encoding deflate hack has been removed and the request header extended. (tested on a Wikipedia HTTP/1.1 compliant server) You may wish to permanently disable the code within #if defined(HAVE_WAITPID) && defined(HAVE_FORK) && defined(HAVE_PIPE) in decode.c if you do not wish to depend on the external gzip tool any more. * _all_ system header file #include's have been removed from the sources and integrated into config.h to allow for better portable source code. config.h.in and autoconf.am have been extended to include several more OS- dependent system call and header file checks. A seperate native Win32 version of the header file is also provided (used by the MSVC2005 native Win32 build). * several hardcoded buffer sizes in the software have been made configurable (but remain hardcoded). See for instance dinfo.c: 12 --> PAVUK_INFO_DIRNAME and 1024-and-other-fixed-buf-sizes --> BUFIO_ADVISED_READLN_BUFSIZE * fixed several cases where dangling (i.e. free()d but not NULL-ed) pointers caused havok. Code has been quickly reviewed to locate and fix additional spots that did not yet cause pavuk to go 'crazy Ivan' (Hunt for the Red October, anyone? ;-) ) * hardcoded lock filenames have been converted to #define's to allow these to be changed in a single spot (config.h), improving portability. e.g.: '._lock' --> PAVUK_LOCK_FILENAME * UNIX-specific octal privs have been changed to their proper #define's to allow for maximum portability (Win32 doesn't know '0644' but can cope with S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH though maybe in a odd way). * fixed quite a few spots where an unidentified form encoding method would lead to _very_ instable bahaviour, including crashes/core dumps. Look for fi->method = FORM_M_UNKNOWN assignments and additonal FORM_M_UNKNOWN checks. * added -no_dns support for those who have to work in an environment with flaky or no DNS support (I had to as I was working on a box in a specially configured, partially walled-off DMZ zone while developing and testing pavuk against a web server.) * fixed typos in the text as I came along them. * the bufio overhaul also lead to a overhaul of the -dumpxxx code, removing/fixing several spots in the code which caused incorrect/instable behaviour. (e.g. code in doc.c) * Fixed handling of compressed data for any text-based server response; pavuk now correctly handles any gzipped/deflated text, including, for instance, any 'text/javascript' content sent over the wire in compressed form (tested on a Wikipedia-based HTTP/1.1 compliant server). * added -progress_mode: several choices in progress verbosity. * added -no_disc_io: test a grab/scan without writing anything to disc. Mostly useful in combination with the earlier -hammer modes. * fixed/updated HTTP error response handling in accordance with RFC2616 so I can better see what a HTTP/1.1 compliant target is reporting back to pavuk. (errcode.c et al) * unified timing units to fix a few timing oddities: instead of minutes, etc. the code uses seconds everywhere (apart, of course, from the few locations where we use milleseconds ;-) ) -timeout is now in milliseconds! * Added -rtimeout and -wtimeout command line parameters. (unit: milliseocnds) * added -allow_persistent / -noallow_persistent commandline arguments to allow/disallow the use of HTTP/1.1 persistent connections. * added -dumpcmd and -dumpdir commandline arguments. * added -bad_content commandline argument for use with the hammer/chunky modes. * added -report_url_on_err commandline argument: report the URL which was processed while the error occurred. * added -test_id commandline argument: this is included in the timing report so reports can be better automatically processed / combined. * added -page_sfx commandline argument to help pavuk identify what suffixes are to be considered web pages (useful for scanning ASP and ASP.NET sites which present unusual mime types with their pages). * added -tlogfile4sum commandline argument: specify a log file where timing info is stored. Handy when pavuk is not only used to grab the info off a site but also scan & report site performance. * added -encode commandline parameter as the counterpart of -noencode. * added -nohtDig, -noquiet and -noverbose commandline parameters as counterparts of -htDig, -quiet and -verbose respectively. * added filepath support to -dumpfd and -dump_urlfd: by specifying the option prefixed with a '@' character, pavuk will treat the option value as a filepath specification instead of a OS file handle and subsequently open the specific file internally. Note that adding yet another '@' character as a prefix signals pavuk to _append_ to the specified file, instead of _overwriting_ it. This is useful when you wish to have those dumps but are working in an environment where you cannot pass valid file handles through the commandline. * added -dump_request and -nodeump_request commandline arguments for use with - dumpfd: when -dump_request is specified, the log file will include a complete dump of each request sent to the server by pavuk. Thus you can produce a complete audit trail of the exchange. * replaced the DUMP_URLLIST macros in stats.c by two functions. Code is a bit cleaner that way. * fixed times.c which barfed on timestamps beyond 2037 (signed int wrap around for time_t). * added assert() checks at several locations in the code to help track down unexpected behaviour which could lead to crashes (like it did till now). * unified the proliferation of HEX2ASC-alike macros with and without off-by- one offsets inside. Now there's one macro for each of 'em in tools.h. * changed the configure.in option to --disable-threads to keep the pattern consistent (--disable-xxx series of options in configure), but the default behaviour remains the same. * configure.in: as --disable-debug removes any debug-_related_ features from the pavuk build, these options have been added: --disable-debugging will create a default build with all debugging removed from the compiled binaries. --disable-prof and --disable-gprof have been added to remove any profile info from the default compiled binaries. * added checks in configure.in for socklen_t, pid_t and a bunch of system calls and header files that do not live in each environment. 2007-may-6 * included pthreads-Win32 based multithreading support in the native Win32 build. * included EXPERIMENTAL tre (regex) support in the native Win32 build. * fixed several lurking bugs (buffer overruns, etc.) which only showed in a multithreaded environment. * fixed locking bugs in the new bufio implementation. * added Win32 memory leak + heap checking for the DEBUG build: many memory leaks have been tracked and fixed. (MSVC based) * fixed memory leak due to wrong scope in report_error() code. * added DBGxxx macro's to aid heap tracking for the debug build. See DBGdecl/DBGpass/DBGvars usage. * removed a very nasty memleak in html_parser_get_url() which would leak at least 3 blocks for each rejected local anchor URL - and those come quite a few! Took me a day to track it down. :-( * added filtering so gzipped/compressed files on the server are not decompressed unintentionally while the server supports Accept- Encoding:gzip or compress. ( doc_download_helper() in doc.c ) Notes: * I'm a bit unsure about that DEFAULTRC and GETTEXT_DEFAULT_CATALOG_DIR stuff, as it is still around in the code and seems to 'make sense' in an odd sort of way, so I've kept those items in there in the ac-config.h.in and configure.in (they disappeared going from the official 0.9.34 to 0.9.35 releases). * pid_t handling is a bit convoluted for the native build (see config.h), but that just 5 lines of code in, well, ... * given that I don't have access to UNIX boxes in 2007, I might have broken the UNIX build (configure.in et al) on the way out. Most of the changes in there are from before when I was able to test them on a few boxes, but YMMV since I recently merged my code with the latest CVS (april/may 2007).