Previous Up Next

Chapter 16  Accumulators

Accumulators summarize data inserted into them. They are useful for quickly computing a “bird’s-eye” view of a given data source. For each piece of a Pads description, the accumulator summarizes the percentage of errors seen and reports the most frequently seen values. For example, when run on sample web server log data, the accumulator report for the length field contains the information shown in Figure 16.1.


<top>.length : uint32
+++++++++++++++++++++++++++++++++++++++++++
good: 53544   bad: 3824    pcnt-bad: 6.666
min: 35  max: 248591  avg: 4090.234
top 10 values out of 1000 distinct values:
tracked 99.552% of values
 val:  3082 count:  1254  %-of-good:  2.342
 val:   170 count:  1148  %-of-good:  2.144
 val:    43 count:  1018  %-of-good:  1.901
 val:  9372 count:   975  %-of-good:  1.821
 val:  1425 count:   896  %-of-good:  1.673
 val:   518 count:   893  %-of-good:  1.668
 val:  1082 count:   881  %-of-good:  1.645
 val:  1367 count:   874  %-of-good:  1.632
 val:  1027 count:   859  %-of-good:  1.604
 val:  1277 count:   857  %-of-good:  1.601
. . . . . . . . . . . . . . . . . . . . . . 
 SUMMING    count:  9655  %-of-good: 18.032
Figure 16.1: Portion of accumulator report for length field of web server log data.

By default, accumulators track the first 1000 distinct values seen in the data source and report the frequency of the top ten values. In this particular run, 99.552% of all values were tracked.

16.1  Operations

Figure 16.2 shows the accumulator type declaration and associated functions for a Pads type.


typedef struct  {
  Puint32_acc nerr;
  order_header_t_acc h;
  eventSeq_t_acc events;
} entry_t_acc;

Perror_t entry_t_acc_init (P_t *pads,entry_t_acc *acc);
Perror_t entry_t_acc_reset (P_t *pads,entry_t_acc *acc);
Perror_t entry_t_acc_cleanup (P_t *pads,entry_t_acc *acc);
Perror_t entry_t_acc_add (P_t *pads,entry_t_acc *acc,
                          entry_t_pd *pd,entry_t *rep);
Perror_t entry_t_acc_report2io (P_t *pads,Sfio_t *outstr,
char const *prefix,
                                
char const *what,int nst,entry_t_acc *acc);
Perror_t entry_t_acc_report (P_t *pads,
char const *prefix,
                             
char const *what,
                             
int nst,entry_t_acc *acc);
Figure 16.2: Accumulator functions generated for the entry_t type.

These functions have the following behaviors:

entry_t_acc_init
Initializes accumulator data structure. This function must be called before any data can be added to the accumulator.
entry_t_acc_reset
Reinitializes accumulator data structure, erasing all information previously stored.
entry_t_acc_cleanup
Deallocates all memory associated with accumulator.
entry_t_acc_add
Inserts argument in-memory representation and parse descriptor into argument accumulator. The parse descriptor allows the accumulator to track errors as well as legal values.
entry_t_acc_report2io
Writes summary report for accumulator acc to open SFIO stream outstr. The argument prefix is a descriptive string, usually the path to the data being accumulated. If NULL, the string "<top>" is used. In the accumulator snippet in Figure 16.1, this path is <top>.length. The argument what is a string describing the kind of data. If NULL, a short for of the accumulator is used as a default, e.g. uint32 for Puint32. The argument nst indicates the nesting level. Level zero should be used for a top-level call. Reporting routines bump the nesting level for recrsive report calls that describe sub-parts. Nesting level -1 indicates a minimal prefix header should be output, i.e., just the prefix without any adornment.
entry_t_acc_report
Writes summary report for accumulator acc to standard error. The other arguments are the same as for entry_t_acc_report2io

Figure 16.3 illustrates a sample use of accumulator functions for printing a summary of CLF entry_ts.


#include "wsl.h"
#define DEF_INPUT_FILE  
"data/wsl"

int main(int argc, char** argv) {
  P_t                  *pads;
  Pio_disc_t           *io_disc;
  entry_t              rep;
  entry_t_pd           pd;
  entry_t_m            mask;
  entry_t_acc          acc;
  
char                 *fname = DEF_INPUT_FILE;

  io_disc = P_nlrec_noseek_make(
0);
  P_open(&pads, 
0, io_disc);

  entry_t_init(pads, &rep);
  entry_t_pd_init(pads, &pd);
  entry_t_m_init(pads, &mask, P_CheckAndSet);

  
if (P_ERR == P_io_fopen(pads, fname)) {
    error(
2"*** P_io_fopen failed ***");
    
return -1;
  }

  entry_t_acc_init(pads, &acc);
  
while (!P_io_at_eof(pads)) {
    entry_t_read(pads, &mask, &pd, &rep);
    entry_t_acc_add(pads, &acc, &pd, &rep);
  };
  entry_t_acc_report(pads, 
""00, &acc);

  P_io_close(pads);
  entry_t_cleanup(pads, &rep);
  entry_t_pd_cleanup(pads, &pd);
  entry_t_acc_cleanup(pads, &acc);
  P_close(pads);
  
return 0;
}
Figure 16.3: Simple use of accumulator functions for the entry_t type from CLF data.

16.2  Customization

The Pads discpline allows users to customize various aspects of accumulation by setting the appropriate field in the discpline. If pads is an active Pads handle, then pads->disc provides access to the discipline, which contains the following accumulator related fields:

acc_max2track
is a Puint64 denoting the default maximum number of distinct values for accumulators to track. Setting this field to P_MAX_UINT64 indicates no limit. Note that the higher the value, the more memory accumulators will consume. By default, the Pads system sets this value to 1000. When an acc_init function is called on a base-type accumulator a, the field a.max2track is set to pads->disc->acc_max2track. The value a.max2track may be modified by hand after this call to force the accumulator a to use a non-default value.
acc_max2rep
is a Puint64 denoting the default number of tracked values for accumulators to describe in detail in the generated report. Setting this field to P_MAX_UINT64 indicates no limit on the tracked values to display. By default, the Pads system sets this value to ten. When an acc_init function is called on a base-type accumulator a, a.max2rep is set to pads->disc->acc_max2rep. The value a.max2rep can be modified by hand after this call to force the accumulator a to use a non-default value.
acc_pcnt2rep
is a Pfloat denoting the default percent of values for accumulators to describe in detail in the generated report. Setting this field to 100.0 indicates no limit on the set of tracked values to display. By default, Pads sets this value to 100.0. Upon calling an acc_init function on some base-type accumulator a, a.pcnt2rep is set to pads->disc->acc_pcnt2rep. a.pcnt2rep can be modified by hand after this call to force the accumulator a to use a non-default value.

Note that both acc_max2rep and acc_pcnt2rep set a limit on the number of tracked values to display. The reporting stops when either limit occurs.

Generated accumulators have components that are base-type accumulators. Thus, after initializing some generated accumulator a, one could modify a.foo.bar.max2track or a.foo.bar.max2rep to change the tracking or reporting of the foo.bar component a.

16.3  Template Program

Because generating an accumulator report from a Pads description is a very routine task, Pads provides a template program to automate the task for common data formats. In particular, the template applies to data that can be viewed as an optional header followed by a sequence of records. Note that any data source that can be read entirely into memory fits this pattern by considering the source to have no header and a single body record.

When instantiated, the template program takes an optional command-line argument specifying the path to the data source. If no argument is given, it uses a default location for the data specified by the template user. The template first reads the optional header, then reads each record and inserts the resulting value into an accumulator until the data source is exhuasted, at which point it prints the accumulator report to standard error. The code in Figure 2.7 illustrates using the accumulator template template/accum_report.h. This template is a C header file parameterized by a number of macros that permit the user to customize the template by defining appropriate values for these macros. For example, in the code in Figure 2.7, the user defines the macros DEF_INPUT_FILE, PADS_TY, and IO_DISC_MK to indicate the default input file, the type of the repeated record in the data source, and the IO discipline. The following list describes these and the other macros used by the accumulator template:

DATE_IN_FMT
If defined, this macro sets the default input format for dates described by Pdate. See Section 15.1.12 for more information.
DATE_OUT_FMT
If defined, this macro sets the default output format for Pdate and Pdate_explicit. See Section 15.1.13 for more information.
DEF_INPUT_FILE
If defined, this macros specifies a string representation of the path to the default data source. If no path to the data is supplied at the command-line, this is the location used for input data.
EXTRA_BAD_READ_CODE
If defined, this macro points to a C statement that will be executed after any body record containing an error.
EXTRA_BEGIN_CODE
If defined, this macro points to a C statement that will be executed after all initialization code is performed, but before the optional header is read.
EXTRA_DECLS
This optional macro defines additional C declarations that proceed all accumulator code.
EXTRA_DONE_CODE
If defined, this macro points to a C statement that will be executed after generating the accumulator report.
EXTRA_GOOD_READ_CODE
If defined, this macro points to a C statement that will be executed after any body record not containing an error.
EXTRA_HEADER_READ_ARGS
If the type of the header record was parameterized, this macro allows the user to supply corresponding parameters.
EXTRA_READ_ARGS
If the type of the repeated record was parameterized, this macro allows the user to supply corresponding parameters.
IN_TIME_ZONE
If set, this macro specifies the input time zone of date types that do not include time zone information. See Section 15.1.10 for more detail.
IO_DISC_MK
If defined, this macro specifies the interpretation of Precord by indicating which IO discpline the system should install. It specifies the discipline by naming the function to create the discipline. Section 15.2 describes the available IO discipline creation functions. If the user does not define this macro, the system installs the IO discipline corresponding to new-line terminated ASCII records.
MAX_RECS
If defined, this macro specifies an integer that limits the number of repeated records that the accumulator program should read.
OUT_TIME_ZONE
If set, this macro specifies the output time zone of date types. See Section 15.1.11 for more detail.
PADS_HDR_TY
Intuitively, this macro defines the type of the header record in the data source. This macro need only be defined if the data source has a header record. It defines a function used by the template program to generate the various function and type names derived from the name of the header record type, i.e., the type of the associated in-memory representation, mask, parse descriptor, read function, etc.
PADS_TY
Intuitively, this macro defines the type of the repeated record in the data source, i.e., the type of the value to be accumulated. This macro must be defined to use the accumulator template. It defines a function used by the template program to generate the various function and type names derived from the name of the record type, i.e., the type of the associated in-memory representation, mask, parse descriptor, read function, etc.
READ_MASK
This macro specifies the mask to use in reading the repeated record. If not defined by the user, the template uses the value P_CheckAndSet.
TIME_IN_FMT
If defined, this macro sets the default input format for Ptime. See Section 15.1.12 for more information.
TIME_OUT_FMT
If defined, this macro sets the default output format for Ptime and Ptime_explicit. See Section 15.1.13 for more information.
TIMESTAMP_IN_FMT
If defined, this macro sets the default input format for Ptimestamp. See Section 15.1.12 for more information.
TIMESTAMP_OUT_FMT
If defined, this macro sets the default output format for the Pads types Ptimestamp and Ptimestamp_explicit. See Section 15.1.13 for more information.
WSPACE_OK
If defined, this macro indicates that leading white space for variable-width ASCII integers is okay, as well as leading and trailing white space for fixed-width ASCII integers.

Previous Up Next