Chapter 2 Tutorial

In this chapter, we use examples to give an overview of Pads. The examples used in this chapter appear in the Pads distribution in the pads/demo directory.

2.1 Example data formats

We start by briefly describing the data formats we will use as running examples.

2.1.1 CLF: Common log format

One of the formats that web servers use to log client requests is the Common Log Format (CLF) [KR01]. Researchers use such logs to measure properties of web workloads and to evaluate protocol changes by ”replaying” the user activity recorded in the log. This ASCII format consists of a sequence of records, each of which has seven fields: the host name or IP address of the client making the request, the account associated with the request on the client side, the name the user provided for authentication, the time of the request, the actual request, the http response code, and the number of bytes returned as a result of the request. The actual request has three parts: the request method (e.g., GET, PUT), the requested uri, and the protocol version. In addition, the second and third fields are often recorded only as a ’-’ character to indicate the server did not record the actual data. Figure 2.1 shows a couple of typical records.

2.1.2 Provisioning data

In the telecommunications industry, the term provisioning refers to the steps necessary to convert an order for phone service into the actual service. To track AT&T’s provisioning process, the Sirius project compiles weekly summaries of the state of certain types of phone service orders. These ASCII summaries store the summary date and one record per order. Each order record contains a header followed by a sequence of events. The header has 13 pipe separated fields: the order number, AT&T’s internal order number, the order version, four different telephone numbers associated with the order, the zip code of the order, a billing identifier, the order type, a measure of the complexity of the order, an unused field, and the source of the order data. Many of these fields are optional, in which case nothing appears between the pipe characters. The billing identifier may not be available at the time of processing, in which case the system generates a unique identifier, and prefixes this value with the string “no_ii” to indicate the number was generated. The event sequence represents the various states a service order goes through; it is represented as a new-line terminated, pipe separated list of state, timestamp pairs. There are over 400 distinct states that an order may go through during provisioning. The sequence is sorted in order of increasing timestamps. Figure 2.2 shows a small example of this format.

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30
tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/dd@grp.org/confirm HTTP/1.0" 200 941

Figure 2.1: Tiny example of web server log data.

0|1005022800
9152|9152|1|9735551212|0||9085551212|07988|no_ii152272|EDTF_6|0|APRL1|DUO|10|1000295291
9153|9153|1|0|0|0|0||152268|LOC_6|0|FRDW1|DUO|LOC_CRTE|1001476800|LOC_OS_10|1001649601

Figure 2.2: Tiny example of Sirius provisioning data.

Punion client_t { Pip ip; /- 135.207.23.32 Phostname host; /- www.research.att.com };Punion auth_id_t { Pchar unauthorized : unauthorized == ’-’; Pstring(:’ ’:) id; };Penum method_t { GET, PUT, POST, HEAD, DELETE, LINK, UNLINK };Pstruct version_t { "HTTP/"; Puint8 major; ’.’; Puint8 minor; }; bool chkVersion(version_t v, method_t m) { if ((v.major == 1) && (v.minor == 1)) return true; if ((m == LINK) || (m == UNLINK)) return false; return true; };Pstruct request_t { ’\"’; method_t meth; ’ ’; Pstring(:’ ’:) req_uri; ’ ’; version_t version : chkVersion(version, meth); ’\"’; };Ptypedef Puint16_FW(:3:) response_t : response_t x => { 100 <= x && x < 600};Precord Pstruct entry_t { client_t client; ’ ’; auth_id_t remoteID; ’ ’; auth_id_t auth; " ["; Pdate(:’]’:) date; "] "; request_t request; ’ ’; response_t response; ’ ’; Puint32 length; };Psource Parray clt_t { entry_t []; }

Figure 2.3: Pads description for web server log data.

Precord Pstruct summary_header_t { "0|"; Puint32 tstamp; };Pstruct no_ramp_t { "no_ii"; Puint64 id; };Punion dib_ramp_t { Pint64 ramp; no_ramp_t genRamp; };Pstruct order_header_t { Puint32 order_num; ’|’; Puint32 att_order_num; ’|’; Puint32 ord_version; ’|’; Popt pn_t service_tn; ’|’; Popt pn_t billing_tn; ’|’; Popt pn_t nlp_service_tn; ’|’; Popt pn_t nlp_billing_tn; ’|’; Popt Pzip zip_code; ’|’; dib_ramp_t ramp; ’|’; Pstring(:’|’:) order_type; ’|’; Puint32 order_details; ’|’; Pstring(:’|’:) unused; ’|’; Pstring(:’|’:) stream; ’|’; };Pstruct event_t { Pstring(:’|’:) state; ’|’; Puint32 tstamp; };Parray eventSeq { event_t[] : Psep(’|’) && Pterm(Peor) ; } Pwhere { Pforall (i Pin [0..length-2] : (elts[i].tstamp <= elts[i+1].tstamp)); };Precord Pstruct entry_t { order_header_t header; eventSeq events; };Parray entries_t { entry_t[]; };Psource Pstruct out_sum{ summary_header_t h; entries_t es; };

Figure 2.4: Partial Pads description for Sirius provisioning data.

2.2 Pads descriptions

Figure 2.3 gives the Pads description for CLF web server logs, while Figure 2.4 gives the description for the Sirius provisioning data. We use these examples to illustrate various features of the Pads language. In Pads descriptions, types are declared before they are used, so the type that describes the totality of the data source appears at the bottom of the description.

Pstructs describe fixed sequences of data with unrelated types. In the CLF description, the type declaration for version_t illustrates a simple Pstruct. It starts with a string literal that matches the constant HTTP/ in the data source. It then has two unsigned integers recording the major and minor version numbers separated by the literal character ’.’. Pads supports character, string, and regular expression literals, which are interpreted with the ambient character encoding. The type request_t similarly describes the request portion of a CLF record. In addition to physical format information, this Pstruct includes a semantic constraint on the version field. Specifically, it requires that obsolete methods LINK and UNLINK occur only under HTTP/1.1. This constraint illustrates the use of predicate functions and the fact that earlier fields are in scope during the processing of later fields, as the constraint refers to both the meth and version fields in the Pstruct. Chapter 5 describes Pstructs in detail.

Punions describe variation in the data source. For example, the client_t type in the CLF description indicates that the first field in a CLF record can be either an IP address or a hostname. During parsing, the branches of a Punion are tried in order; the first branch that parses without error is taken. The auth_id_t type illustrates the use of a constraint: the branch unauthorized is chosen only if the parsed character is a dash. Pads also supports a switched union that uses a selection expression to determine the branch to parse. Typically, this expression depends upon already-parsed portions of the data source.

Pads provides Parrays to describe varying-length sequences of data all with the same type. The eventSeq_t declaration in the Sirius data description uses a Parray to characterize the sequence of events an order goes through during processing. This declaration indicates that the elements in the sequence have type event_t. It also specifies that the elements will be separated by vertical bars, and that the sequence will be terminated by an end-of-record marker (Peor). In general, Pads provides a rich collection of array-termination conditions: reaching a maximum size, finding a terminating literal (including end-of-record and end-of-source), or satisfying a user-supplied predicate over the already-parsed portion of the Parray. Finally, this type declaration includes a Pwhere clause to specify that the sequence of timestamps must be in sorted order. It uses the Pforall construct to express this constraint. In general, the body of a Pwhere clause can be any boolean expression. In such a context for arrays, the pseudo-variable elts is bound to the in-memory representation of the sequence and length to its length.

Returning to the CLF description in Figure 2.3, the Penum type method_t describes a collection of data literals. During parsing, Pads interprets these constants using the ambient character encoding. The Ptypedef response_t describes possible server response codes in CLF data by adding the constraint that the three-digit integer must be between 100 and 600.

The order_header_t type in the Sirius data description contains several anonymous uses of the Popt type. This type is syntactic sugar for a stylized use of a Punion with two branches: the first with the indicated type, and the second with the “void” type, which always matches but never consumes any input.

2.3 Generated library

From a description, the Pads compiler generates a C library for parsing and manipulating the associated data source.

typedef struct { Pbase_m compoundLevel; /* Struct-level controls, eg., check Pwhere clause */ order_header_t_m h; eventSeq_t_m events; } entry_t_m;typedef struct { Pflags_t pstate; /* Normal, Partial, or Panicking */ Puint32 nerr; /* Number of detected errors. */ PerrCode_t errCode; /* Error code of first detected error */ Ploc_t loc; /* Location of first error */ order_header_t_pd h; /* Nested header information */ eventSeq_t_pd events; /* Nested event sequence information */} entry_t_pd;typedef struct { order_header_t h; eventSeq_t events; } entry_t; /* Core parsing library */Perror_t entry_t_read (P_t *pads,entry_t_m *m,entry_t_pd *pd, entry_t *rep); /* Selected utility functions */void entry_t_m_init (P_t *pads,entry_t_m *mask,Pbase_m baseMask);int entry_t_verify (entry_t *rep); /* Selected accumulator functions */Perror_t entry_t_acc_init (P_t *pads,entry_t_acc *acc); Perror_t entry_t_acc_add (P_t *pads,entry_t_acc *acc, entry_t_pd *pd,entry_t *rep); Perror_t entry_t_acc_report (P_t *pads,char const *prefix, char const *what, int nst,entry_t_acc *acc); ssize_t entry_t_write2io (P_t *pads,Sfio_t *io,entry_t_pd *pd,entry_t *rep); /* Formatting */ssize_t entry_t_fmt2io (P_t *pads,Sfio_t *io,int *requestedOut, char const *delims,entry_t_m *m,entry_t_pd *pd, entry_t *rep); /* Conversion to XML */ssize_t entry_t_write_xml_2io (P_t *pads,Sfio_t *io,entry_t_pd *pd, entry_t *rep,char const *tag,int indent);

Figure 2.5: Selected portions of the library generated for the entry_t declaration from Sirius data description.

To give a feeling for the library that Pads generates, Figure 2.5 includes selected portions of the generated library for the Sirius entry_t declaration.

2.3.1 Example library use

Figure 2.6 shows a simple use of the generated Sirius library to filter out ill-formed records and normalize the representation of optional phone numbers. The code first initializes the Pads library handle, p, by invoking the core library function P_open. The second argument to this function allows the user to customize behavior in the Pads library by passing a (pointer to a) Pdisc_t discipline. With this discipline, the user can specify properties such as the endianness of the data, the default character set (ASCII or EBCIDC), error handling, etc. The third argument to the P_open function specifies a (pointer to a) Pio_disc_t discipline, which allows the user to describe how to detect record boundaries, i.e., are records new-line terminated, fixed width, EBCDIC-style, etc. Passing zero in these argument positions triggers default behavior, which means ASCII encoded, little endian, new-line terminated records. Chapter 15 describes the use of disciplines in more detail.

After initializing the Pads handle, the code uses the core library function P_io_fopen to open the data source by specify the path to the data on disk. The code then uses generated functions to initialize the mask, parse descriptor, and in-memory representations for the header and the entry types. The specified mask values instruct the parsing code to check all conditions in the Sirius description except the sorting of the timestamps. The P_Set flag instead instructs the compiler to simply set the in-memory representation for the timestamp sequence.

After reading the header, the code echoes error records to one file and cleaned ones to another. The raw data has two different representations of unavailable phone numbers: simply omitting the number altogether, which corresponds to the NONE branch of the Popt, or having the value 0 in the data. The function cnvPhoneNumbers unifies these two representations by converting the zeroes into NONEs. The function entry_t_verify ensures that our computation hasn’t broken any of the semantic properties of the in-memory representation of the data.

Finally, the core library functions P_io_close and P_close close the IO stream and Pads library, respectively.

P_t *p; summary_header_t header; summary_header_t_pd header_pd; summary_header_t_m header_m; entry_t entry; entry_t_pd pd; entry_t_m mask; P_open(&p, 0, 0); P_io_fopen(p, "data/sirius"); /* Initialize header data structures */summary_header_t_init(p, &header); summary_header_t_pd_init(p,&header_pd); summary_header_t_m_init(p, &header_m, P_CheckAndSet); /* Initialize entry data structures */entry_t_init(p, &entry); entry_t_pd_init(p, &pd); entry_t_m_init(p, &mask, P_CheckAndSet); mask.events.compoundLevel = P_Set; /* Try to read header */if (P_OK == summary_header_t_read(p, &header_m, &header_pd, &header)) { summary_header_t_write2io(p, CLEAN_FILE, &header_pd, &header); } else { error(2, "reading header returned: error"); } /* Try to read each line of data */while (!P_io_at_eof(p)) { entry_t_read(p, &mask, &pd, &entry); if (pd.nerr > 0) { entry_t_write2io(p, ERR_FILE, &pd, &entry); } else { cnvPhoneNumbers(&entry); if (entry_t_verify(&entry)) { entry_t_write2io(p, CLEAN_FILE, &pd, &entry); } else { error(2, "Data transform failed."); } } } P_io_close(p); P_close(p);

Figure 2.6: Fragment of a program to filter and normalize Sirius data.

2.3.2 Accumulators

Before using a data source, analysts must develop an understanding of both the layout and the meaning of the data. Because documentation is usually incomplete or out-of-date, this understanding must be developed through exploring the data itself. Typical questions include: how complete is the description of the syntax of the data source, how many different representations for “data not available" are there, what is the distribution of values for particular fields, etc. Pads addresses these kinds of questions with the notion of an accumulator. For each type in a Pads description, accumulators track the number of good values, the number of bad values, and the distribution of legal values. Selected functions from this portion of the library appear in Figure 2.5; more detailed information appears in Chapter 16.

We can of course use these functions by hand to write a program to compute the statistical profile of any Pads data source. However, ad hoc sources are often simply a sequence of records, perhaps prefixed by a header. For example, both the web server log and the Sirius data sources exhibit this pattern, as does any data format that can be read in one bulk read. As a convenience, Pads supplies a program template to construct accumulator programs for such data sources. Pads provides this template as a C include file and supports customization via C’s macro system. Using this template, we can write an accumulator program by specifying only the names of the optional header type and the record type. Figure 2.7 contains the entirety of the user-specified accumulator program text.

#define DEF_INPUT_FILE "data/wsl" /* Default data location */#define PADS_TY(suf) entry_t ## suf /* Name of record type = entry_t */#define IO_DISC_MK P_nlrec_make(0) /* Records are new line terminated */#include "wsl.h" /* Header file for generated library */#include "template/accum_report.h" /* Accumulator template program */

Figure 2.7: Accumulator program to construct statistical profiles of web server log data.

The accumulator report for the length field of the web server data that results when run on a data set used in several studies of web traffic [KW00, KW02] appears in Figure 2.8.

<top>.length : uint32
+++++++++++++++++++++++++++++++++++++++++++
good: 53544   bad: 3824    pcnt-bad: 6.666
min: 35  max: 248591  avg: 4090.234
top 10 values out of 1000 distinct values:
tracked 99.552% of values
 val:  3082 count:  1254  %-of-good:  2.342
 val:   170 count:  1148  %-of-good:  2.144
 val:    43 count:  1018  %-of-good:  1.901
 val:  9372 count:   975  %-of-good:  1.821
 val:  1425 count:   896  %-of-good:  1.673
 val:   518 count:   893  %-of-good:  1.668
 val:  1082 count:   881  %-of-good:  1.645
 val:  1367 count:   874  %-of-good:  1.632
 val:  1027 count:   859  %-of-good:  1.604
 val:  1277 count:   857  %-of-good:  1.601
. . . . . . . . . . . . . . . . . . . . . . 
 SUMMING    count:  9655  %-of-good: 18.032

Figure 2.8: Portion of accumulator report for length field of web server log data.

By default, accumulators track the first 1000 distinct values seen in the data source and report the frequency of the top ten values. In this particular run, 99.552% of all values were tracked. When generating the accumulator program (or when using the library directly), Pads users can specify the number of distinct values to track and the number of values to print in the report. Details about customizing accumulator programs appear in Chapter 16.

Perhaps surprisingly, the report shows that 6.66% of the length fields contained errors. A glance at the error log generated by the program (which contains all records flagged as errors) reveals that web servers occasionally store the ’-’ character rather than the actual number of bytes returned, a possibility not mentioned in the documentation [KR01]. Accumulators often serve as a quick tool for iteratively refining a Pads description until only genuine errors remain.

2.3.3 Formatting

To support converting ad hoc data into a delimited format, the Pads library generates a formatting function for each type. This function, an example of which appears in Figure 2.5, takes a delimiter list as an argument. At each field boundary, it prints the first delimiter. At each nested type boundary, it advances the delimiter list unless the list is exhausted, in which case it reuses the last delimiter. The mask argument allows the user to suppress printing of portions of the data. Programmers can use the library directly to write formatting programs by hand. However, as in the accumulator case, Pads can generate a formatting program for commonly occurring data patterns given only the header type (optional), record type, and a delimiter string. Users can further customize the generated program by specifying an output format for dates and mask values as illustrated in Figure 2.9.

#define DEF_INPUT_FILE "data/wsl"#include "wsl.h"#define PADS_TY(suf) entry_t ## suf #define IO_DISC_MK P_nlrec_make(0) #define DATE_OUT_FMT "%D:%T"#define DELIMS "|"#include "template/read_format.h"

Figure 2.9: PADS program for formatting CLF records using formatting template.

Given the delimiter string "|" and the output date format "%D:%T", the generated web server log formatting program yields the output shown in Figure 2.10 when applied to the sample data in Figure 2.1.

207.136.97.49|-|-|10/16/97:01:46:51|GET|/tk/p.txt|1|0|200|30
tj62.aol.com|-|-|10/16/97:21:32:22|POST|/scpt/dd@grp.org/confirm|1|0|200|941

Figure 2.10: Formatted CLF records.

To support customization, Pads allows users to provide their own formatting functions for any type. More information on formatting functions can be found in Chapter 19.

2.3.4 Conversion to Xml

Pads also supports converting ad hoc data into XML by providing a canonical mapping from Pads descriptions into XML. This mapping is quite natural, as both Pads and XML are languages for describing semi-structured data. One interesting aspect of the mapping is that we embed not just the in-memory representation of Pads values, but also the parse descriptors in cases where the data was buggy. This choice allows users to explore the error portions of their data sources, which can be the most interesting parts of the data. Given a Pads specification, the Pads compiler generates an XML Schema describing the canonical embedding for that data source. As an example, Figure 2.11 shows the portion of the Xml Schema generated from the Sirius data description related to the eventSeq type.

<xs:complexType name="eventSeq_pd">
<xs:sequence>
<xs:element name="pstate" type="Pflags_t"/>
<xs:element name="nerr" type="Puint32"/>
<xs:element name="errCode" type="PerrCode_t"/>
<xs:element name="loc" type="Ploc_t"/>
<xs:element name="neerr" type="Puint32"/>
<xs:element name="firstError" type="Puint32"/>
<xs:element name="elt" type="Puint32" 
    minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>

<xs:complexType name="eventSeq">
<xs:sequence>
<xs:element name="elt" type="event" 
    minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="length" type="Puint32"/>
<xs:element name="pd" type="eventSeq_pd" 
    minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>

Figure 2.11: Portion of Xml Schema generated from Sirius data description.

The Pads compiler will put the generated schema in the same directory as the generated library. The schema file will share the name of the Pads description and have the extension xsd.

The Pads compiler generates a write_xml_2io function for each type, an example of which is shown in Figure 2.5. Programmers can use these functions by hand to write conversion programs, or they can use

Pads also provides a template for generating such conversion programs for data that can be read entirely into memory.¹ Figure 2.12 contains an example of such a program. More information about converting to Xml can be found in Chapter 20.

#define DEF_INPUT_FILE "data/sirius" /* Default data source */#define PADS_TY(suf) out_sum ## suf /* Record type = out_sum */#define IO_DISC_MK P_nlrec_make(0) /* Records are new line terminated */#include "sirius.h" /* Header file for data source */#include "template/read_orig_write_xml.h" /* XML Conversion template */

Figure 2.12: Program to convert Sirius data into Xml using Pads template.

1: The Xml template program also permits the programmer to specify a header type followed by a body record type. However, the generated Xml does not include the appropriate top-level declarations to conform to the generated Xml schema. We are working on addressing this problem.