Previous Up Next

Chapter 17  Histogram

A histogram is a piecewise-constant approximation of an observed data distribution. It is used as a small space, approximate synopsis of the underlying data distribution which are ofter too large to be stored precisely. Histograms are built for each meaningful piece of a Pads description. Figure 17.1 is an example report for the length field of a web server log data.


*** Histogram Result *** 
From 0 to 397, with height 4016. 
From 398 to 398, with height 36122. 
From 399 to 423, with height 5584. 
From 424 to 424, with height 33250. 
From 425 to 499, with height 3126. 
*** Histogram Result *** 
From 0 to 196, with height 3286. 
From 197 to 197, with height 30430. 
From 198 to 313, with height 3233. 
From 314 to 314, with height 30430. 
From 315 to 499, with height 3655. 
*** Histogram Result *** 
From 0 to 242, with height 3686. 
From 243 to 243, with height 36122. 
From 244 to 441, with height 3720. 
From 442 to 442, with height 26074. 
From 443 to 499, with height 7035. 
*** Histogram Result *** 
From 0 to 93, with height 4204. 
From 94 to 94, with height 36122. 
From 95 to 206, with height 3000. 
From 207 to 210, with height 21496. 
From 211 to 499, with height 3890. 
. . . . . . . . . . . . . . . . . . . . . . 
Figure 17.1: Portion of histogram report for length field of web server log data.

In this particular run, optimal 5-bucket histogram is built for every 500 values seen in the data source.

17.1  Operations

Figure 17.2 shows the histogram functions declared for a Pads type.


Perror_t entry_t_hist_init          (P_t *pads,entry_t_hist *h);
Perror_t entry_t_hist_setPara       (P_t *pads,entry_t_hist *h,P_hist *d_hist);
Perror_t entry_t_hist_reset         (P_t *pads,entry_t_hist *h);
Perror_t entry_t_hist_cleanup       (P_t *pads,entry_t_hist *h);
Perror_t entry_t_hist_add           (P_t *pads,entry_t_hist *h,Pbase_pd *pd,entry_t *rep,Puint32 *isFull);
Perror_t entry_t_hist_reportFull2io (P_t *pads,Sfio_t *outstr,const char *prefix,
                                     const char *what, int nst,entry_t_hist *h);
Perror_t entry_t_hist_reportAll2io  (P_t *pads,Sfio_t *outstr,const char *prefix,
                                     const char *what, int nst,entry_t_hist *h);
Perror_t entry_t_hist_reportFull    (P_t *pads,const char *prefix,const char *what,int nst,entry_t_hist *h);
Perror_t entry_t_hist_reportAll     (P_t *pads,const char *prefix,const char *what,int nst,entry_t_hist *h);
Figure 17.2: Histogram functions generated for the entry_t type.

These functions have the following behaviors:

entry_t_hist_init
Initializes histogram data structure. This function must be called before any data can be added to the histogram.
entry_t_hist_setPara
Customizes histogram data structure. For the two conversion functions (specified below), user needs to set the corresponding fields in the template program. This function must be called to make any customization effected.
entry_t_hist_reset
Reinitializes histogram data structure. This function can be used to set any point of the data source as the start point of a new run. But it can’t be used to reset any previous defined parameters.
entry_t_hist_cleanup
Deallocates all memory associated with histogram.
entry_t_hist_add
Inserts a data value. This function is called once a new record is coming. Any data type with an associated mapping function to Pfloat64 is considered as a meaningful type. This function tracks fields with meaningful type and legal values only. The output parameter isFull will be set nonzero, if the current data is the last one of a portion.
entry_t_hist_reportFull2io
Writes summary report for finished histograms to *outstr. In most cases, when this function is called, all stored histograms will be reported and the space will be released, while the current one won’t.
entry_t_hist_reportAll2io
Writes summary report for all histograms to *outstr. When this function is called, all histograms will be reported and the space will be released.
entry_t_hist_reportFull
Writes summary report for finished histograms to screen.
entry_t_hist_reportAll
Writes summary report for all histograms to screen.

Figure 17.3 illustrates a sample use of histogram functions for printing a summary of CLF entry_ts.


#include "wsl.h"
#define DEF_INPUT_FILE  
"data/wsl"

int main(int argc, char** argv) {
  P_t                  *pads;
  Pio_disc_t           *io_disc;
  P_hist               default_hist;
  entry_t              rep;
  entry_t_pd           pd;
  entry_t_m            mask;
  entry_t_hist         h;
  Puint32              isFull;
  
char                 *fname = DEF_INPUT_FILE;

  io_disc = P_nlrec_noseek_make(
0);
  P_open(&pads, 
0, io_disc);

  entry_t_init(pads, &rep);
  entry_t_pd_init(pads, &pd);
  entry_t_m_init(pads, &mask, P_CheckAndSet);

  
if (P_ERR == P_io_fopen(pads, fname)) {
    error(
2"*** P_io_fopen failed ***");
    
return -1;
  }

  entry_t_hist_init(pads, &h);
  default_hist.toFloat=0;
  default_hist.fromFloat=0;
  entry_t_hist_setPara(pads, h, default_hist);
  
while (!P_io_at_eof(pads)) {
    entry_t_read(pads, &mask, &pd, &rep);
    entry_t_hist_add(pads, &h, &pd, &rep, &isFull);
    
if (isFull != 0) entry_t_hist_reportFull(pads, ""00, &h);
  }
  entry_t_acc_reportAll(pads, 
""00, &h);

  P_io_close(pads);
  entry_t_cleanup(pads, &rep);
  entry_t_pd_cleanup(pads, &pd);
  entry_t_hist_cleanup(pads, &h);
  P_close(pads);
  
return 0;
}
Figure 17.3: Simple use of histogram functions for the entry_t type from CLF data.

17.2  Customization

Users are allowed to customize various aspects of histogram by setting the appropriate field in the histogram data structure, which contains:

INIT_N
is a Puint64 denoting the number of values for histogram to summarize. If the number of values in the data source exceeds INIT_N, histograms will be built on each INIT_N data values respectively, until the end of data source is reached.
INIT_B
is a Puint32 denoting the number of buckets in final histogram. As INIT_B increases, accuracy of final histogram approximation is increased, while more time and space is consumed. The default value is 10.
INIT_M
is a Pint64 denoting an upper bound of data values in data source. Time consumed increases in poly-logrithm of INIT_M, so INIT_M can be set very large if little about data values in data source is known.
INIT_ISE
is a Pint8 denoting whether buckets in the final histogram are required to be of the same width. If INIT_ISE is set to be non-zero, all buckets have equal width. In this case, the time needed is in linear-INIT_N, and only constant space will be used. However, the result histogram will have less accuracy.
INIT_ISO
is a Pint8 denoting whether final histogram is required to be optimal or not. This parameter is valid only when INIT_ISE is zero. If INIT_ISO is set to be non-zero, the result histogram will be the most accurate one among all INIT_B bucket histograms. However, the time needed is in cubic-INIT_N, which could be extremely slow, and the space needed is in linear-INIT_N, since all data values in each INIT_N section are required to be stored.
INIT_n
is a Pint8 denoting whether norm 1 or norm 2 is used to measure accuracy of final histogram. Currently, norm 1 measurement is supported only when all the data values are stored. In other words, it is supported only when INIT_ISE is zero, and INIT_ISO is non-zero.
INIT_e
is a Pfloat64 denoting error tolerance of the final histogram. This parameter is valid only when non-optimal result is allowed, namely both INIT_ISE and INIT_ISO are zeroes. The final histogram will be guaranteed to be no worse than poly- (1+INIT_e) times of the optimal one, but the time and space needed increase as the error tolerance decreases.
INIT_scale
is a Pint64 denoting scale factor for each data value. This parameter is important for computing, but will not affect the final result. For example, if the data source can take values up to 64 bits, the overall SSE could need as many as 128 bits, which exceeds the representation limit of Pads. In this case, INIT_scale is needed.
INIT_maxPortion
is a Pint8 denoting the maximal number of stored histograms. If user forgets to report a finished histogram, or the inner histogram is finished while the outer one is not in nested cases, the finished histogram will go into a stored histogram list. If a new histogram is required to build, while the number of stored histograms is already INIT_maxPortion, the list will be cleared and a warning will be reported.
entry_t_toFloat
is a function pointer, taking entry_t as input parameter, and returning corresponding Pfloat64. Histograms will handle Pfloat64 type data value only. Any type with a well-defined conversion function to Pfloat64 is considered as a meaningful type, and could be summarized correctly by histograms. By default, all base types other than Pstring in Pads have conversion functions to Pfloat64. Users are allowed to write their own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default conversion functions will be used.
entry_t_fromFloat
is a function pointer, taking Pfloat64 as input parameter, and returning corresponding entry_t type. Any type without a well-defined conversion function from Pfloat64 may not be printed correctly. By default, all base types other than Pstring in Pads have conversion functions from Pfloat64. Users are allowed to write their own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default conversion functions will be used.

17.3  Template Program

Because generating a histogram report from a Pads description is a very routine task, Pads provides a template program to automate the task for common data formats. In particular, the template applies to data that can be viewed as an optional header followed by a sequence of records. Note that any data source that can be read entirely into memory fits this pattern by considering the source to have no header and a single body record.

When instantiated, the template program takes an optional command-line argument specifying the path to the data source. If no argument is given, it uses a default location for the data specified by the template user. The template first reads the optional header, then reads each record and inserts the value of each meanful field into histogram until either the data source is exhuasted or the end of a portion is reached, at which point it prints the histogram report to standard io. The following list describes the macros used by histogram template:

DEF_INPUT_FILE
If defined, this macro specifies a string representation of the path to the default data source. If no path to the data is supplied at the command-line, this is the location used for input data.
EXTRA_BEGIN_CODE
If defined, this macro points to a C statement that will be executed after all initialization code is performed, but before the optional header is read.
EXTRA_DECLS
This optional macro defines additional C declarations that proceed all template code.
EXTRA_DONE_CODE
If defined, this macro points to a C statement that will be executed after generating the accumulator report.
EXTRA_INIT_CODE
This optional macro defines additional C codes that customize histogram data structure for different fields.
EXTRA_READ_ARGS
If the type of the repeated record was parameterized, this macro allows the user to supply corresponding parameters.
IO_DISC_MK
If defined, this macro specifies the interpretation of Precord by indicating which IO discpline the system should install. It specifies the discipline by naming the function to create the discipline. Section 15.2 describes the available IO discipline creation functions. If the user does not define this macro, the system installs the IO discipline corresponding to new-line terminated ASCII records.
PADS_HDR_TY
Intuitively, this macro defines the type of the header record in the data source. This macro need only be defined if the data source has a header record. It defines a function used by the template program to generate the various function and type names derived from the name of the header record type, i.e., the type of the associated in-memory representation, mask, parse descriptor, read function, etc.
PADS_TY
Intuitively, this macro defines the type of the repeated record in the data source, i.e., the type of the value to be summarized. This macro must be defined to use the histogram template. It defines a function used by the template program to generate the various function and type names derived from the name of the record type, i.e., the type of the associated in-memory representation, mask, parse descriptor, read function, etc.
READ_MASK
This macro specifies the mask to use in reading the repeated record. If not defined by the user, the template uses the value P_CheckAndSet.

Previous Up Next