A histogram is a piecewise-constant approximation of an observed
data distribution. It is used as a small space, approximate
synopsis of the underlying data distribution which are ofter too
large to be stored precisely. Histograms are built for each meaningful
piece of a Pads description. Figure 17.1
is an example report for the length field of a web server log data.
*** Histogram Result ***
From 0 to 397, with height 4016.
From 398 to 398, with height 36122.
From 399 to 423, with height 5584.
From 424 to 424, with height 33250.
From 425 to 499, with height 3126.
*** Histogram Result ***
From 0 to 196, with height 3286.
From 197 to 197, with height 30430.
From 198 to 313, with height 3233.
From 314 to 314, with height 30430.
From 315 to 499, with height 3655.
*** Histogram Result ***
From 0 to 242, with height 3686.
From 243 to 243, with height 36122.
From 244 to 441, with height 3720.
From 442 to 442, with height 26074.
From 443 to 499, with height 7035.
*** Histogram Result ***
From 0 to 93, with height 4204.
From 94 to 94, with height 36122.
From 95 to 206, with height 3000.
From 207 to 210, with height 21496.
From 211 to 499, with height 3890.
. . . . . . . . . . . . . . . . . . . . . .
Figure 17.1: Portion of histogram report for length field of web server
log data. |
In this particular run, optimal 5-bucket histogram is built for
every 500 values seen in the data source.
17.1 Operations
Figure 17.2 shows the histogram functions declared
for a Pads type.
Perror_t entry_t_hist_init (P_t *pads,entry_t_hist *h);
Perror_t entry_t_hist_setPara (P_t *pads,entry_t_hist *h,P_hist *d_hist);
Perror_t entry_t_hist_reset (P_t *pads,entry_t_hist *h);
Perror_t entry_t_hist_cleanup (P_t *pads,entry_t_hist *h);
Perror_t entry_t_hist_add (P_t *pads,entry_t_hist *h,Pbase_pd *pd,entry_t *rep,Puint32 *isFull);
Perror_t entry_t_hist_reportFull2io (P_t *pads,Sfio_t *outstr,const char *prefix,
const char *what, int nst,entry_t_hist *h);
Perror_t entry_t_hist_reportAll2io (P_t *pads,Sfio_t *outstr,const char *prefix,
const char *what, int nst,entry_t_hist *h);
Perror_t entry_t_hist_reportFull (P_t *pads,const char *prefix,const char *what,int nst,entry_t_hist *h);
Perror_t entry_t_hist_reportAll (P_t *pads,const char *prefix,const char *what,int nst,entry_t_hist *h);
Figure 17.2: Histogram functions generated for the entry_t type. |
These functions have the following behaviors:
-
entry_t_hist_init
- Initializes histogram data
structure. This function must be called before any data can be added
to the histogram.
- entry_t_hist_setPara
- Customizes histogram data
structure. For the two conversion functions (specified below), user
needs to set the corresponding fields in the template program. This
function must be called to make any customization effected.
- entry_t_hist_reset
- Reinitializes histogram data
structure. This function can be used to set any point of the data
source as the start point of a new run. But it can’t be used to
reset any previous defined parameters.
- entry_t_hist_cleanup
- Deallocates all memory associated
with histogram.
- entry_t_hist_add
- Inserts a data value. This function
is called once a new record is coming. Any data type with an
associated mapping function to Pfloat64 is considered as a
meaningful type. This function tracks fields with meaningful type
and legal values only. The output parameter isFull will be set
nonzero, if the current data is the last one of a portion.
- entry_t_hist_reportFull2io
- Writes summary report for
finished histograms to *outstr. In most cases, when this
function is called, all stored histograms will be reported and the
space will be released, while the current one won’t.
- entry_t_hist_reportAll2io
- Writes summary report for
all histograms to *outstr. When this function is called, all
histograms will be reported and the space will be released.
- entry_t_hist_reportFull
- Writes summary report for
finished histograms to screen.
- entry_t_hist_reportAll
- Writes summary report for
all histograms to screen.
Figure 17.3 illustrates a sample use of histogram
functions for printing a summary of CLF entry_ts.
#include "wsl.h"
#define DEF_INPUT_FILE "data/wsl"
int main(int argc, char** argv) {
P_t *pads;
Pio_disc_t *io_disc;
P_hist default_hist;
entry_t rep;
entry_t_pd pd;
entry_t_m mask;
entry_t_hist h;
Puint32 isFull;
char *fname = DEF_INPUT_FILE;
io_disc = P_nlrec_noseek_make(0);
P_open(&pads, 0, io_disc);
entry_t_init(pads, &rep);
entry_t_pd_init(pads, &pd);
entry_t_m_init(pads, &mask, P_CheckAndSet);
if (P_ERR == P_io_fopen(pads, fname)) {
error(2, "*** P_io_fopen failed ***");
return -1;
}
entry_t_hist_init(pads, &h);
default_hist.toFloat=0;
default_hist.fromFloat=0;
entry_t_hist_setPara(pads, h, default_hist);
while (!P_io_at_eof(pads)) {
entry_t_read(pads, &mask, &pd, &rep);
entry_t_hist_add(pads, &h, &pd, &rep, &isFull);
if (isFull != 0) entry_t_hist_reportFull(pads, "", 0, 0, &h);
}
entry_t_acc_reportAll(pads, "", 0, 0, &h);
P_io_close(pads);
entry_t_cleanup(pads, &rep);
entry_t_pd_cleanup(pads, &pd);
entry_t_hist_cleanup(pads, &h);
P_close(pads);
return 0;
}
Figure 17.3: Simple use of histogram functions for the
entry_t type from CLF data. |
17.2 Customization
Users are allowed to customize various aspects of histogram by
setting the appropriate field in the histogram data structure,
which contains:
-
INIT_N
- is a Puint64 denoting the number of values
for histogram to summarize. If the number of values in the data
source exceeds INIT_N, histograms will be built on each
INIT_N data values respectively, until the end of data
source is reached.
- INIT_B
- is a Puint32 denoting the number of buckets
in final histogram. As INIT_B increases, accuracy of final
histogram approximation is increased, while more time and space is
consumed. The default value is 10.
- INIT_M
- is a Pint64 denoting an upper bound of data
values in data source. Time consumed increases in poly-logrithm of
INIT_M, so INIT_M can be set very large if little about
data values in data source is known.
- INIT_ISE
- is a Pint8 denoting whether buckets in the final
histogram are required to be of the same width. If INIT_ISE is
set to be non-zero, all buckets have equal width. In this case,
the time needed is in linear-INIT_N, and only constant space will
be used. However, the result histogram will have less accuracy.
- INIT_ISO
- is a Pint8 denoting whether final histogram
is required to be optimal or not. This parameter is valid only
when INIT_ISE is zero. If INIT_ISO is set to be
non-zero, the result histogram will be the most accurate one among
all INIT_B bucket histograms. However, the time needed is in
cubic-INIT_N, which could be extremely slow, and the space
needed is in linear-INIT_N, since all data values in each
INIT_N section are required to be stored.
- INIT_n
- is a Pint8 denoting whether norm 1 or norm 2
is used to measure accuracy of final histogram. Currently, norm 1
measurement is supported only when all the data values are
stored. In other words, it is supported only when INIT_ISE is
zero, and INIT_ISO is non-zero.
- INIT_e
- is a Pfloat64 denoting error tolerance of the
final histogram. This parameter is valid only when non-optimal
result is allowed, namely both INIT_ISE and INIT_ISO are
zeroes. The final histogram will be guaranteed to be no worse than
poly- (1+INIT_e) times of the optimal one, but the time and
space needed increase as the error tolerance decreases.
- INIT_scale
- is a Pint64 denoting scale factor for
each data value. This parameter is important for computing, but will
not affect the final result. For example, if the data source can
take values up to 64 bits, the overall SSE could need
as many as 128 bits, which exceeds the representation limit of
Pads. In this case, INIT_scale is needed.
- INIT_maxPortion
- is a Pint8 denoting the maximal
number of stored histograms. If user forgets to report a finished
histogram, or the inner histogram is finished while the outer one is
not in nested cases, the finished histogram will go into a stored
histogram list. If a new histogram is required to build, while the
number of stored histograms is already INIT_maxPortion, the
list will be cleared and a warning will be reported.
- entry_t_toFloat
- is a function pointer, taking
entry_t as input parameter, and returning corresponding
Pfloat64. Histograms will handle Pfloat64 type data value
only. Any type with a well-defined conversion function to
Pfloat64 is considered as a meaningful type, and could be summarized
correctly by histograms. By default, all base types other than Pstring in Pads have
conversion functions to Pfloat64. Users are allowed to write their
own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default
conversion functions will be used.
- entry_t_fromFloat
- is a function pointer, taking
Pfloat64 as input parameter, and returning corresponding
entry_t type. Any type without a well-defined conversion
function from Pfloat64 may not be printed correctly. By
default, all base types other than Pstring in Pads have
conversion functions from Pfloat64. Users are allowed to write their
own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default
conversion functions will be used.
17.3 Template Program
Because generating a histogram report from a Pads description is a
very routine task, Pads provides a template program to automate the
task for common data formats. In particular, the template applies to
data that can be viewed as an optional header followed by a sequence
of records. Note that any data source that can be read entirely into
memory fits this pattern by considering the source to have no header
and a single body record.
When instantiated, the template program takes an optional command-line
argument specifying the path to the data source. If no argument is
given, it uses a default location for the data specified by the
template user. The template first reads the optional header, then
reads each record and inserts the value of each meanful field into
histogram until either the data source is exhuasted or the end of a
portion is reached, at which point it prints the histogram report to
standard io. The following list describes the macros used by
histogram template:
- DEF_INPUT_FILE
- If defined, this macro specifies a string
representation of the path to the default data source. If no path to
the data is supplied at the command-line, this is the location used
for input data.
- EXTRA_BEGIN_CODE
- If defined, this macro points to a C
statement that will be executed after all initialization code is
performed, but before the optional header is read.
- EXTRA_DECLS
- This optional macro defines additional C
declarations that proceed all template code.
- EXTRA_DONE_CODE
- If defined, this macro points to a C
statement that will be executed after generating the accumulator report.
- EXTRA_INIT_CODE
- This optional macro defines additional C
codes that customize histogram data structure for different fields.
- EXTRA_READ_ARGS
- If the type of the repeated record was
parameterized, this macro allows the user to supply corresponding
parameters.
- IO_DISC_MK
- If defined, this macro specifies the
interpretation of Precord by indicating which IO discpline the
system should install. It specifies the discipline by naming the
function to create the discipline. Section 15.2
describes the available IO discipline creation functions. If the
user does not define this macro, the system installs the IO
discipline corresponding to new-line terminated ASCII records.
- PADS_HDR_TY
- Intuitively, this macro defines the type of
the header record in the data source. This macro need only be
defined if the data source has a header record. It defines a function used by the template program to
generate the various function and type names derived from the name
of the header record type, i.e., the type of the associated in-memory
representation, mask, parse descriptor, read function, etc.
- PADS_TY
- Intuitively, this macro defines the type of the repeated
record in the data source, i.e., the type of the value to be
summarized. This macro must be defined to use the histogram
template. It defines a function used by the template program to
generate the various function and type names derived from the name
of the record type, i.e., the type of the associated in-memory
representation, mask, parse descriptor, read function, etc.
- READ_MASK
- This macro specifies the mask to use in reading
the repeated record. If not defined by the user, the template uses
the value P_CheckAndSet.