Accumulators summarize data inserted into them. They are useful for
quickly computing a “bird’s-eye” view of a given data source. For
each piece of a Pads description, the accumulator summarizes the
percentage of errors seen and reports the most frequently seen values.
For example,
when run on sample web server log data,
the accumulator report for the length field contains the information
shown in Figure 16.1.
<top>.length : uint32
+++++++++++++++++++++++++++++++++++++++++++
good: 53544 bad: 3824 pcnt-bad: 6.666
min: 35 max: 248591 avg: 4090.234
top 10 values out of 1000 distinct values:
tracked 99.552% of values
val: 3082 count: 1254 %-of-good: 2.342
val: 170 count: 1148 %-of-good: 2.144
val: 43 count: 1018 %-of-good: 1.901
val: 9372 count: 975 %-of-good: 1.821
val: 1425 count: 896 %-of-good: 1.673
val: 518 count: 893 %-of-good: 1.668
val: 1082 count: 881 %-of-good: 1.645
val: 1367 count: 874 %-of-good: 1.632
val: 1027 count: 859 %-of-good: 1.604
val: 1277 count: 857 %-of-good: 1.601
. . . . . . . . . . . . . . . . . . . . . .
SUMMING count: 9655 %-of-good: 18.032
Figure 16.1: Portion of accumulator report for length field of web server
log data. |
By default, accumulators track the first 1000 distinct
values seen in the data source and report the frequency
of the top ten values. In this particular run, 99.552%
of all values were tracked.
16.1 Operations
Figure 16.2 shows the accumulator type
declaration and associated functions for a Pads type.
typedef struct {
Puint32_acc nerr;
order_header_t_acc h;
eventSeq_t_acc events;
} entry_t_acc;
Perror_t entry_t_acc_init (P_t *pads,entry_t_acc *acc);
Perror_t entry_t_acc_reset (P_t *pads,entry_t_acc *acc);
Perror_t entry_t_acc_cleanup (P_t *pads,entry_t_acc *acc);
Perror_t entry_t_acc_add (P_t *pads,entry_t_acc *acc,
entry_t_pd *pd,entry_t *rep);
Perror_t entry_t_acc_report2io (P_t *pads,Sfio_t *outstr,char const *prefix,
char const *what,int nst,entry_t_acc *acc);
Perror_t entry_t_acc_report (P_t *pads,char const *prefix,
char const *what,
int nst,entry_t_acc *acc);
Figure 16.2: Accumulator functions generated for the entry_t type. |
These functions have the following behaviors:
-
entry_t_acc_init
- Initializes accumulator data
structure. This function must be called before any data can be added
to the accumulator.
- entry_t_acc_reset
- Reinitializes accumulator data
structure, erasing all information previously stored.
- entry_t_acc_cleanup
- Deallocates all memory associated
with accumulator.
- entry_t_acc_add
- Inserts argument in-memory
representation and parse descriptor into argument
accumulator. The parse descriptor allows the accumulator to track
errors as well as legal values.
- entry_t_acc_report2io
- Writes summary report for
accumulator acc to open
SFIO stream outstr. The argument prefix is a descriptive
string, usually the path to the data being accumulated. If
NULL, the string "<top>" is used. In
the accumulator snippet in Figure 16.1, this
path is <top>.length. The argument what is a string
describing the kind of data. If NULL, a short for of the
accumulator is used as a default, e.g. uint32 for
Puint32. The argument nst indicates the nesting
level. Level zero should be used for a top-level call. Reporting
routines bump the nesting level for recrsive report calls that
describe sub-parts. Nesting level -1 indicates a
minimal prefix header should be output, i.e., just the prefix
without any adornment.
- entry_t_acc_report
- Writes summary report for
accumulator acc to standard error. The other arguments are the
same as for entry_t_acc_report2io
Figure 16.3 illustrates a sample use of accumulator
functions for printing a summary of CLF entry_ts.
#include "wsl.h"
#define DEF_INPUT_FILE "data/wsl"
int main(int argc, char** argv) {
P_t *pads;
Pio_disc_t *io_disc;
entry_t rep;
entry_t_pd pd;
entry_t_m mask;
entry_t_acc acc;
char *fname = DEF_INPUT_FILE;
io_disc = P_nlrec_noseek_make(0);
P_open(&pads, 0, io_disc);
entry_t_init(pads, &rep);
entry_t_pd_init(pads, &pd);
entry_t_m_init(pads, &mask, P_CheckAndSet);
if (P_ERR == P_io_fopen(pads, fname)) {
error(2, "*** P_io_fopen failed ***");
return -1;
}
entry_t_acc_init(pads, &acc);
while (!P_io_at_eof(pads)) {
entry_t_read(pads, &mask, &pd, &rep);
entry_t_acc_add(pads, &acc, &pd, &rep);
};
entry_t_acc_report(pads, "", 0, 0, &acc);
P_io_close(pads);
entry_t_cleanup(pads, &rep);
entry_t_pd_cleanup(pads, &pd);
entry_t_acc_cleanup(pads, &acc);
P_close(pads);
return 0;
}
Figure 16.3: Simple use of accumulator functions for the
entry_t type from CLF data. |
16.2 Customization
The Pads discpline allows users to customize various aspects of
accumulation by setting the appropriate field in the discpline. If
pads is an active Pads handle, then pads->disc provides
access to the discipline, which contains the following accumulator
related fields:
-
acc_max2track
- is a Puint64 denoting the default maximum number of distinct values
for accumulators to track. Setting this field to P_MAX_UINT64
indicates no limit. Note that the higher the value, the more memory
accumulators will consume. By default, the Pads system sets
this value to 1000. When an acc_init function is
called on a base-type accumulator a, the field
a.max2track is set to pads->disc->acc_max2track.
The value a.max2track may be modified by hand after this call
to force the accumulator a to use a non-default value.
- acc_max2rep
- is a Puint64 denoting the default number of tracked values for
accumulators to describe in detail in the generated report. Setting
this field to P_MAX_UINT64 indicates no limit on the tracked values
to display. By default, the Pads system sets this value to
ten. When an acc_init function is called on a
base-type accumulator a, a.max2rep is set to
pads->disc->acc_max2rep. The value a.max2rep can be
modified by hand after this call to force the accumulator a to
use a non-default value.
- acc_pcnt2rep
- is a Pfloat denoting the default percent of values for
accumulators to describe in detail in the generated report. Setting this field to 100.0
indicates no limit on the set of tracked values to display. By
default, Pads sets this value to 100.0.
Upon calling an acc_init function on some base-type accumulator a,
a.pcnt2rep is set to pads->disc->acc_pcnt2rep.
a.pcnt2rep can be modified by hand after this call to force
the accumulator a to use a non-default value.
Note that both acc_max2rep and acc_pcnt2rep set a limit on
the number of tracked values to display. The reporting stops when
either limit occurs.
Generated accumulators have components that are base-type
accumulators. Thus, after initializing some generated accumulator
a, one could modify a.foo.bar.max2track or
a.foo.bar.max2rep to change the tracking or reporting of the
foo.bar component a.
16.3 Template Program
Because generating an accumulator report from a Pads description is
a very routine task, Pads provides a template program to automate
the task for common data formats. In particular, the template applies
to data that can be viewed as an optional header followed by a
sequence of records. Note that any data source that can be read
entirely into memory fits this pattern by considering the source to
have no header and a single body record.
When instantiated, the template program takes an optional command-line
argument specifying the path to the data source. If no argument is
given, it uses a default location for the data specified by the
template user.
The template first reads the optional header, then
reads each record and inserts the resulting value into an
accumulator until the data source is exhuasted, at which point it
prints the accumulator report to standard error.
The code in Figure 2.7
illustrates using the accumulator template
template/accum_report.h. This template is a C header file
parameterized by a number of macros that permit the user to customize
the template by defining appropriate values for these macros. For
example, in the code in Figure 2.7, the user defines the
macros DEF_INPUT_FILE, PADS_TY, and IO_DISC_MK to
indicate the default input file, the type of the repeated record in
the data source, and the IO discipline.
The following list describes these and the other macros used by the
accumulator template:
-
DATE_IN_FMT
- If defined, this macro sets the default
input format for dates described by Pdate. See
Section 15.1.12 for more
information.
- DATE_OUT_FMT
- If defined, this macro sets the default
output format for Pdate and Pdate_explicit. See
Section 15.1.13 for more information.
- DEF_INPUT_FILE
- If defined, this macros specifies a
string representation of the path to the default data source. If no
path to the data is supplied at the command-line, this is the
location used for input data.
- EXTRA_BAD_READ_CODE
- If defined, this macro points to a C
statement that will be executed after any body record containing an
error.
- EXTRA_BEGIN_CODE
- If defined, this macro points to a C
statement that will be executed after all initialization code is
performed, but before the optional header is read.
- EXTRA_DECLS
- This optional macro defines additional C
declarations that proceed all accumulator code.
- EXTRA_DONE_CODE
- If defined, this macro points to a C
statement that will be executed after generating the accumulator report.
- EXTRA_GOOD_READ_CODE
- If defined, this macro points to a C
statement that will be executed after any body record not containing an
error.
- EXTRA_HEADER_READ_ARGS
- If the type of the header record
was parameterized, this macro allows the user to supply
corresponding parameters.
- EXTRA_READ_ARGS
- If the type of the repeated record was
parameterized, this macro allows the user to supply corresponding
parameters.
- IN_TIME_ZONE
- If set, this macro specifies the input time
zone of date types that do not include time zone information.
See Section 15.1.10 for more detail.
- IO_DISC_MK
- If defined, this macro specifies the
interpretation of Precord by indicating which IO discpline the
system should install. It specifies the discipline by naming the
function to create the discipline. Section 15.2
describes the available IO discipline creation functions. If the
user does not define this macro, the system installs the IO
discipline corresponding to new-line terminated ASCII records.
- MAX_RECS
- If defined, this macro specifies an integer that
limits the number of repeated records that the accumulator program
should read.
- OUT_TIME_ZONE
- If set, this macro specifies the output
time zone of date types.
See Section 15.1.11 for more detail.
- PADS_HDR_TY
- Intuitively, this macro defines the
type of the header record in the data source. This macro need only
be defined if the data source has a header record.
It defines a function used by the template
program to generate the various function and type names derived from
the name of the header record type, i.e., the type of the associated
in-memory representation, mask, parse descriptor, read function,
etc.
- PADS_TY
- Intuitively, this macro defines the
type of the repeated record in the data source, i.e., the type of
the value to be accumulated. This macro must be defined to use the
accumulator template. It defines a function used by the template
program to generate the various function and type names derived from
the name of the record type, i.e., the type of the associated
in-memory representation, mask, parse descriptor, read function,
etc.
- READ_MASK
- This macro specifies the mask to use in reading
the repeated record. If not defined by the user, the template uses
the value P_CheckAndSet.
- TIME_IN_FMT
- If defined, this macro sets the default
input format for Ptime. See
Section 15.1.12 for more
information.
- TIME_OUT_FMT
- If defined, this macro sets the default
output format for Ptime and Ptime_explicit. See
Section 15.1.13 for more information.
- TIMESTAMP_IN_FMT
- If defined, this macro sets the default
input format for Ptimestamp. See
Section 15.1.12 for more
information.
- TIMESTAMP_OUT_FMT
- If defined, this macro sets the default
output format for the Pads types Ptimestamp and Ptimestamp_explicit. See
Section 15.1.13 for more information.
- WSPACE_OK
- If defined, this macro indicates that leading
white space for variable-width ASCII integers is okay, as well as
leading and trailing white space for fixed-width ASCII integers.