Chapter 18 Cluster

Clustering program divides data into several groups based on certain distribution. It summarizes the data by recording specified features of each group. Clustering is built for each meaningful piece of a Pads description. Figure 18.1 is an example report for a web server log data.

[Describing each tag arm of <top>.host]

=====================================================================================================
<top>.host.resolved : array nIP of Puint8
=====================================================================================================
Array lengths: 
Clustering based distribution: User defined distribution. 
mean 4, and variance 0, containing 4 elements. 
=====================================================================================================
Possible anormality based on probability 0.010000: 
Possible anormality based on clustering elements number 0.100000: 

-----------------------------------------------------------------------------------------------------
allArrayElts : uint8
-----------------------------------------------------------------------------------------------------

Clustering based distribution: User defined distribution. 
mean 128, and variance 77, containing 8 elements. 
mean 136, and variance 0, containing 4 elements. 
mean 97, and variance 0, containing 4 elements. 
=====================================================================================================
Possible anormality based on probability 0.010000: 
Data (around): 49 
Data (around): 207 
Data (around): 49 
Data (around): 207 
Data (around): 50 
Data (around): 207 
Data (around): 50 
Possible anormality based on clustering elements number 0.100000: 

=====================================================================================================
<top>.host.symbolic : array sIP of Pstring_SE
=====================================================================================================
Array lengths: 
Clustering based distribution: User defined distribution. 
mean 4, and variance 0, containing 7 elements. 
=====================================================================================================
Possible anormality based on probability 0.010000: 
Possible anormality based on clustering elements number 0.100000: 

-----------------------------------------------------------------------------------------------------
allArrayElts : string
-----------------------------------------------------------------------------------------------------

Clustering based distribution: User defined distribution. 
mean non defined., and variance non defined., containing 28 elements. 
=====================================================================================================
Possible anormality based on probability 0.010000: 
Possible anormality based on clustering elements number 0.100000: 
. . . . . . . . . . . . . . . . . . . . . .

Figure 18.1: Portion of clustering report for web server log data.

In this particular run, maximal 3 clusterings are built for all the data values seen in the data source.

18.1 Operations

Figure 18.2 shows the clustering functions declared for a Pads type.

Perror_t entry_t_cluster_init (P_t *pads,entry_t_cluster *h); Perror_t entry_t_cluster_setPara (P_t *pads,entry_t_cluster *h,P_cluster *d_cluster); Perror_t entry_t_cluster_reset (P_t *pads,entry_t_cluster *h); Perror_t entry_t_cluster_cleanup (P_t *pads,entry_t_cluster *h); Perror_t entry_t_cluster_add (P_t *pads,entry_t_cluster *h,Pbase_pd *pd,entry_t *rep,Puint32 *isFull); Perror_t entry_t_cluster_report2io (P_t *pads,Sfio_t *outstr,const char *prefix, const char *what, int nst,entry_t_cluster *h); Perror_t entry_t_cluster_report (P_t *pads,const char *prefix,const char *what,int nst,entry_t_cluster *h);

Figure 18.2: Clustering functions generated for the entry_t type.

These functions have the following behaviors:

entry_t_hist_init: Initializes clustering data structure. This function must be called before any data can be added to the programme.
entry_t_hist_setPara: Customizes clustering data structure. For the distribution function and two conversion functions (specified below), user needs to set the corresponding fields explicitly. This function must be called to make any customization effected.
entry_t_hist_reset: Reinitializes clustering data structure. This function can be used to set any point of the data source as the start point of a new run. But it can’t be used to reset any previous defined parameters.
entry_t_hist_cleanup: Deallocates all memory associated with clustering.
entry_t_hist_add: Inserts a data value. This function is called once a new record is coming. Any data type with an associated mapping function to Pfloat64 is considered as a meaningful type. This function tracks fields with meaningful type and legal values only.
entry_t_hist_report2io: Writes summary report for clustering c to *outstr.
entry_t_hist_report: Writes summary report for clustering c to screen.

Figure 18.3 illustrates a sample use of clustering functions for printing a summary of CLF entry_t.

#include "wsl.h"#define DEF_INPUT_FILE "data/wsl"int main(int argc, char** argv) { P_t *pads; Pio_disc_t *io_disc; P_cluster default_cluster; entry_t rep; entry_t_pd pd; entry_t_m mask; entry_t_cluster c; Puint32 isFull; char *fname = DEF_INPUT_FILE; io_disc = P_nlrec_noseek_make(0); P_open(&pads, 0, io_disc); entry_t_init(pads, &rep); entry_t_pd_init(pads, &pd); entry_t_m_init(pads, &mask, P_CheckAndSet); if (P_ERR == P_io_fopen(pads, fname)) { error(2, "*** P_io_fopen failed ***"); return -1; } entry_t_cluster_init(pads, &h); default_cluster.toFloat=0; default_cluster.fromFloat=0; default_cluster.Distri_fn=0; entry_t_cluster_setPara(pads, h, default_cluster); while (!P_io_at_eof(pads)) { entry_t_read(pads, &mask, &pd, &rep); entry_t_cluster_add(pads, &h, &pd, &rep), &isFull); } entry_t_acc_report(pads, "", 0, 0, &h); P_io_close(pads); entry_t_cleanup(pads, &rep); entry_t_pd_cleanup(pads, &pd); entry_t_cluster_cleanup(pads, &h); P_close(pads); return 0; }

Figure 18.3: Simple use of clustering functions for the entry_t type from CLF data.

18.2 Customization

Users are allowed to customize various aspects of clustering by setting the appropriate field in the clustering data structure, which contains:

INIT_CTYPE: is an enumeration denoting the type of the underlying distribution for each clustering. Built-in distributions include K_mean, Gaussian distribution, Exponential distribution and Laplace distribution. Users are allowed to add any distributions, which could be fully characterized by mean and variance, by setting the field Distri_fn, which will be specified later.
INIT_K: is a Puint32 denoting the maximal number of clusterings users want to use to divide the date source. Together with INIT_CTYPE, it decides the underlying model of the data source.
INIT_OPEN: is a Pfloat64 denoting the probability threshold for opening a new clustering. A new clustering will be opened for a coming data value, if and only if, the number of current clusterings is less than INIT_K and the probabilities it falls in all current clusterings are less than INIT_OPEN.
INIT_INITVAR: is a Pfloat64 denoting the initial variance for each clustering. It takes effect only before the second data item is inserted. After that, the variance of each clustering will be fully decided by its elements.
INIT_ANORM_POS: is a Pfloat64 denoting the probability threshold for detecting anormality. A data value will be reported as anormality if no new clustering is opened for it, and the probabilities it falls in all existing clusterings are less than INIT_ANORM_POS. The data value detected later in the data source is expected to be more accurate than the one detected at the beginning of the data source.
INIT_ANORM_NUM: is a Pfloat64 denoting the element number threshold for detecting anormality. A whole clustering will be reported as anormality if the number of its elements is less than INIT_ANORM_NUM of the total number of data items in the data source.
entry_t_probFn: is a function pointer, taking mean, variance and data value as input, and returning corresponding probability of that data value, according to mean and variance. Users could define their own distribution for each clustering, as long as the distribution is fully specified by mean and variance. Doing this, they need: first, set INIT_CTYPE to be OTHERS; then, use EXTRA_INIT_CODE to define their own distribution function, and assign them to this pointer. If OTHERS is set to INIT_CTYPE, and zero is set to this pointer, Gaussian distribution will be used.
entry_t_toFloat: is a function pointer, taking entry_t as input parameter, and returning corresponding Pfloat64. Clusterings will handle Pfloat64 type data value only. Any type with a well-defined conversion function to Pfloat64 is considered as a meaningful type, and could be summarized correctly by clusterings. By default, all base types other than Pstring in Pads have conversion functions to Pfloat64. Users are allowed to write their own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default conversion functions will be used.
entry_t_fromFloat: is a function pointer, taking Pfloat64 as input parameter, and returning corresponding entry_t type. Any type without a well-defined conversion function from Pfloat64 may not be printed correctly. By default, all base types other than Pstring in Pads have conversion functions from Pfloat64. Users are allowed to write their own conversion function for each field by defining macro EXTRA_INIT_CODE. If zero is assigned to this pointer, those default conversion functions will be used.

18.3 Template Program

Because generating a clustering report from a Pads description is a very routine task, Pads provides a template program to automate the task for common data formats. In particular, the template applies to data that can be viewed as an optional header followed by a sequence of records. Note that any data source that can be read entirely into memory fits this pattern by considering the source to have no header and a single body record.

When instantiated, the template program takes an optional command-line argument specifying the path to the data source. If no argument is given, it uses a default location for the data specified by the template user. The template first reads the optional header, then reads each record and inserts the value of each meanful field into clustering until either the data source is exhuasted or the end of a portion is reached, at which point it prints the clustering report to standard io. The following list describes the macros used by clustering template:

DEF_INPUT_FILE: If defined, this macro specifies a string representation of the path to the default data source. If no path to the data is supplied at the command-line, this is the location used for input data.
EXTRA_BEGIN_CODE: If defined, this macro points to a C statement that will be executed after all initialization code is performed, but before the optional header is read.
EXTRA_DECLS: This optional macro defines additional C declarations that proceed all template code.
EXTRA_DONE_CODE: If defined, this macro points to a C statement that will be executed after generating the accumulator report.
EXTRA_INIT_CODE: This optional macro defines additional C codes that customize clustering data structure for different fields.
EXTRA_READ_ARGS: If the type of the repeated record was parameterized, this macro allows the user to supply corresponding parameters.
IO_DISC_MK: If defined, this macro specifies the interpretation of Precord by indicating which IO discpline the system should install. It specifies the discipline by naming the function to create the discipline. Section 15.2 describes the available IO discipline creation functions. If the user does not define this macro, the system installs the IO discipline corresponding to new-line terminated ASCII records.
PADS_HDR_TY: Intuitively, this macro defines the type of the header record in the data source. This macro need only be defined if the data source has a header record. It defines a function used by the template program to generate the various function and type names derived from the name of the header record type, i.e., the type of the associated in-memory representation, mask, parse descriptor, read function, etc.
PADS_TY: Intuitively, this macro defines the type of the repeated record in the data source, i.e., the type of the value to be summarized. This macro must be defined to use the clustering template. It defines a function used by the template program to generate the various function and type names derived from the name of the record type, i.e., the type of the associated in-memory representation, mask, parse descriptor, read function, etc.
READ_MASK: This macro specifies the mask to use in reading the repeated record. If not defined by the user, the template uses the value P_CheckAndSet.