Clustering program divides data into several groups based on
certain distribution. It summarizes the data by recording specified
features of each group. Clustering is built for each meaningful
piece of a Pads description. Figure 18.1
is an example report for a web server log data.
[Describing each tag arm of <top>.host]
=====================================================================================================
<top>.host.resolved : array nIP of Puint8
=====================================================================================================
Array lengths:
Clustering based distribution: User defined distribution.
mean 4, and variance 0, containing 4 elements.
=====================================================================================================
Possible anormality based on probability 0.010000:
Possible anormality based on clustering elements number 0.100000:
-----------------------------------------------------------------------------------------------------
allArrayElts : uint8
-----------------------------------------------------------------------------------------------------
Clustering based distribution: User defined distribution.
mean 128, and variance 77, containing 8 elements.
mean 136, and variance 0, containing 4 elements.
mean 97, and variance 0, containing 4 elements.
=====================================================================================================
Possible anormality based on probability 0.010000:
Data (around): 49
Data (around): 207
Data (around): 49
Data (around): 207
Data (around): 50
Data (around): 207
Data (around): 50
Possible anormality based on clustering elements number 0.100000:
=====================================================================================================
<top>.host.symbolic : array sIP of Pstring_SE
=====================================================================================================
Array lengths:
Clustering based distribution: User defined distribution.
mean 4, and variance 0, containing 7 elements.
=====================================================================================================
Possible anormality based on probability 0.010000:
Possible anormality based on clustering elements number 0.100000:
-----------------------------------------------------------------------------------------------------
allArrayElts : string
-----------------------------------------------------------------------------------------------------
Clustering based distribution: User defined distribution.
mean non defined., and variance non defined., containing 28 elements.
=====================================================================================================
Possible anormality based on probability 0.010000:
Possible anormality based on clustering elements number 0.100000:
. . . . . . . . . . . . . . . . . . . . . .
Figure 18.1: Portion of clustering report for web server log data. |
In this particular run, maximal 3 clusterings are built for all the
data values seen in the data source.
18.1 Operations
Figure 18.2 shows the clustering functions declared
for a Pads type.
Perror_t entry_t_cluster_init (P_t *pads,entry_t_cluster *h);
Perror_t entry_t_cluster_setPara (P_t *pads,entry_t_cluster *h,P_cluster *d_cluster);
Perror_t entry_t_cluster_reset (P_t *pads,entry_t_cluster *h);
Perror_t entry_t_cluster_cleanup (P_t *pads,entry_t_cluster *h);
Perror_t entry_t_cluster_add (P_t *pads,entry_t_cluster *h,Pbase_pd *pd,entry_t *rep,Puint32 *isFull);
Perror_t entry_t_cluster_report2io (P_t *pads,Sfio_t *outstr,const char *prefix,
const char *what, int nst,entry_t_cluster *h);
Perror_t entry_t_cluster_report (P_t *pads,const char *prefix,const char *what,int nst,entry_t_cluster *h);
Figure 18.2: Clustering functions generated for the entry_t type. |
These functions have the following behaviors:
-
entry_t_hist_init
- Initializes clustering data
structure. This function must be called before any data can be added
to the programme.
- entry_t_hist_setPara
- Customizes clustering data
structure. For the distribution function and two conversion
functions (specified below), user needs to set the corresponding
fields explicitly. This function must be called to make any
customization effected.
- entry_t_hist_reset
- Reinitializes clustering data
structure. This function can be used to set any point of the data
source as the start point of a new run. But it can’t be used to
reset any previous defined parameters.
- entry_t_hist_cleanup
- Deallocates all memory associated
with clustering.
- entry_t_hist_add
- Inserts a data value. This function
is called once a new record is coming. Any data type with an
associated mapping function to Pfloat64 is considered as a
meaningful type. This function tracks fields with meaningful type
and legal values only.
- entry_t_hist_report2io
- Writes summary report for
clustering c to *outstr.
- entry_t_hist_report
- Writes summary report for
clustering c to screen.
Figure 18.3 illustrates a sample use of clustering
functions for printing a summary of CLF entry_t.
#include "wsl.h"
#define DEF_INPUT_FILE "data/wsl"
int main(int argc, char** argv) {
P_t *pads;
Pio_disc_t *io_disc;
P_cluster default_cluster;
entry_t rep;
entry_t_pd pd;
entry_t_m mask;
entry_t_cluster c;
Puint32 isFull;
char *fname = DEF_INPUT_FILE;
io_disc = P_nlrec_noseek_make(0);
P_open(&pads, 0, io_disc);
entry_t_init(pads, &rep);
entry_t_pd_init(pads, &pd);
entry_t_m_init(pads, &mask, P_CheckAndSet);
if (P_ERR == P_io_fopen(pads, fname)) {
error(2, "*** P_io_fopen failed ***");
return -1;
}
entry_t_cluster_init(pads, &h);
default_cluster.toFloat=0;
default_cluster.fromFloat=0;
default_cluster.Distri_fn=0;
entry_t_cluster_setPara(pads, h, default_cluster);
while (!P_io_at_eof(pads)) {
entry_t_read(pads, &mask, &pd, &rep);
entry_t_cluster_add(pads, &h, &pd, &rep), &isFull);
}
entry_t_acc_report(pads, "", 0, 0, &h);
P_io_close(pads);
entry_t_cleanup(pads, &rep);
entry_t_pd_cleanup(pads, &pd);
entry_t_cluster_cleanup(pads, &h);
P_close(pads);
return 0;
}
Figure 18.3: Simple use of clustering functions for the
entry_t type from CLF data. |
18.2 Customization
Users are allowed to customize various aspects of clustering by
setting the appropriate field in the clustering data structure,
which contains:
-
INIT_CTYPE
- is an enumeration denoting the type of the
underlying distribution for each clustering. Built-in distributions
include K_mean, Gaussian distribution, Exponential distribution and
Laplace distribution. Users are allowed to add any distributions,
which could be fully characterized by mean and variance, by setting
the field Distri_fn, which will be specified later.
- INIT_K
- is a Puint32 denoting the maximal number of
clusterings users want to use to divide the date source. Together
with INIT_CTYPE, it decides the underlying model of the data
source.
- INIT_OPEN
- is a Pfloat64 denoting the probability
threshold for opening a new clustering. A new clustering will be
opened for a coming data value, if and only if, the number of current
clusterings is less than INIT_K and the probabilities it falls
in all current clusterings are less than INIT_OPEN.
- INIT_INITVAR
- is a Pfloat64 denoting the initial
variance for each clustering. It takes effect only before the second
data item is inserted. After that, the variance of each clustering
will be fully decided by its elements.
- INIT_ANORM_POS
- is a Pfloat64 denoting the
probability threshold for detecting anormality. A data value will
be reported as anormality if no new clustering is opened for it, and the
probabilities it falls in all existing clusterings are less than
INIT_ANORM_POS. The data value detected later in the data source
is expected to be more accurate than the one detected at the
beginning of the data source.
- INIT_ANORM_NUM
- is a Pfloat64 denoting the element
number threshold for detecting anormality. A whole clustering will be
reported as anormality if the number of its elements is less than
INIT_ANORM_NUM of the total number of data items in the data source.
- entry_t_probFn
- is a function pointer, taking mean,
variance and data value as input, and returning corresponding
probability of that data value, according to mean and variance. Users
could define their own distribution for each clustering, as long as the
distribution is fully specified by mean and variance. Doing this,
they need: first, set INIT_CTYPE to be OTHERS; then, use
EXTRA_INIT_CODE to define their own distribution function, and
assign them to this pointer. If OTHERS is set to INIT_CTYPE,
and zero is set to this pointer, Gaussian distribution will be used.
- entry_t_toFloat
- is a function pointer, taking
entry_t as input parameter, and returning corresponding
Pfloat64. Clusterings will handle Pfloat64 type data value
only. Any type with a well-defined conversion function to
Pfloat64 is considered as a meaningful type, and could be summarized
correctly by clusterings. By default, all base types other than Pstring in Pads have
conversion functions to Pfloat64. Users are allowed to write their
own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default
conversion functions will be used.
- entry_t_fromFloat
- is a function pointer, taking
Pfloat64 as input parameter, and returning corresponding
entry_t type. Any type without a well-defined conversion
function from Pfloat64 may not be printed correctly. By
default, all base types other than Pstring in Pads have
conversion functions from Pfloat64. Users are allowed to write their
own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default
conversion functions will be used.
18.3 Template Program
Because generating a clustering report from a Pads description is a
very routine task, Pads provides a template program to automate the
task for common data formats. In particular, the template applies to
data that can be viewed as an optional header followed by a sequence
of records. Note that any data source that can be read entirely into
memory fits this pattern by considering the source to have no header
and a single body record.
When instantiated, the template program takes an optional command-line
argument specifying the path to the data source. If no argument is
given, it uses a default location for the data specified by the
template user. The template first reads the optional header, then
reads each record and inserts the value of each meanful field into
clustering until either the data source is exhuasted or the end of a
portion is reached, at which point it prints the clustering report to
standard io. The following list describes the macros used by
clustering template:
- DEF_INPUT_FILE
- If defined, this macro specifies a string
representation of the path to the default data source. If no path to
the data is supplied at the command-line, this is the location used
for input data.
- EXTRA_BEGIN_CODE
- If defined, this macro points to a C
statement that will be executed after all initialization code is
performed, but before the optional header is read.
- EXTRA_DECLS
- This optional macro defines additional C
declarations that proceed all template code.
- EXTRA_DONE_CODE
- If defined, this macro points to a C
statement that will be executed after generating the accumulator report.
- EXTRA_INIT_CODE
- This optional macro defines additional C
codes that customize clustering data structure for different fields.
- EXTRA_READ_ARGS
- If the type of the repeated record was
parameterized, this macro allows the user to supply corresponding
parameters.
- IO_DISC_MK
- If defined, this macro specifies the
interpretation of Precord by indicating which IO discpline the
system should install. It specifies the discipline by naming the
function to create the discipline. Section 15.2
describes the available IO discipline creation functions. If the
user does not define this macro, the system installs the IO
discipline corresponding to new-line terminated ASCII records.
- PADS_HDR_TY
- Intuitively, this macro defines the type of
the header record in the data source. This macro need only be
defined if the data source has a header record. It defines a function used by the template program to
generate the various function and type names derived from the name
of the header record type, i.e., the type of the associated in-memory
representation, mask, parse descriptor, read function, etc.
- PADS_TY
- Intuitively, this macro defines the type of the repeated
record in the data source, i.e., the type of the value to be
summarized. This macro must be defined to use the clustering
template. It defines a function used by the template program to
generate the various function and type names derived from the name
of the record type, i.e., the type of the associated in-memory
representation, mask, parse descriptor, read function, etc.
- READ_MASK
- This macro specifies the mask to use in reading
the repeated record. If not defined by the user, the template uses
the value P_CheckAndSet.