Chapter 15 Library customization

The Pads core library is parameterized by a main discpline and by an IO discipline that allow users to customize the behavior of the core system and the IO subsystem, respectively. One aspect of controlling IO is choosing what constitues a Precord in the data source.

15.1 The Pads discipline

A Pads handle, i.e., a value of type P_t, contains a field named my_disc of type Pdisc_t, which appears in Figure 15.1. The various fields of my_disc allow users to customize aspects of the Pads system. We describe these fields in the following sections.

typedef struct Pdisc_s { Pflags_t version; /* interface version */ Pflags_t flags; /* control flags */ Pcharset def_charset; /* default char set */ int copy_strings; /* if set, ASCII string read functions copy strings */ /* For the next four values, 0 means end-of-record / soft limit for non-record-based IO disciplines */ size_t match_max; /* max match distance */ size_t numeric_max; /* max numeric value distance */ size_t scan_max; /* max normal scan distance */ size_t panic_max; /* max panic scan distance */ Pfopen_fn fopen_fn; /* file open function (default P_fopen) */ Perror_fn error_fn; /* error function using ... */ PerrorRep e_rep; /* controls error reporting */ Pendian_t d_endian; /* endian-ness of the data */ Puint64 acc_max2track; /* default maximum distinct values for accumulators to track */ Puint64 acc_max2rep; /* default maximum number of tracked values to describe in detail in report */ Pfloat64 acc_pcnt2rep; /* default maximum percent of values to describe in detail in report */ const char *in_time_zone; /* default time zone for time input, specified as a string */ const char *out_time_zone; /* default time zone for time formatted output, specified as a string */ Pin_formats_t in_formats; /* default input formats */ Pout_formats_t out_formats; /* default output formats */ Pinv_val_fn_map_t *inv_val_fn_map; /* map types to inv_val_fn for write functions */ Pfmt_fn_map_t *fmt_fn_map; /* map types to fmt functions */ Pio_disc_t *io_disc; /* sub-discipline for controlling IO */} Pdisc_t;

Figure 15.1: The Pdisc_t type, which allows users to customize the behavior of the Pads system.

15.1.1 Version

This field stores the interface version of the Pads library. The C macro P_VERSION refers to the current version.

15.1.2 Control flags

The flags field is a combination of bit fields, described in the following sections. At the moment, only one such field is in use.

White space

If the P_WSPACE_OK bit is set in the flags field, then leading white space is allowed for variable-width ASCII integers. Similarly, leading and/or trailing white space is allowed for fixed-width ASCII integers.

15.1.3 Character encodings

Pads supports ASCII and EBCDIC encodings. The def_charset field indicates the character set for interpreting base types with no explicit character encoding by storing one of the following values:

 Pcharset_ASCII
 Pcharset_EBCDIC

15.1.4 Copying strings

If copy_string field in Pads discipline is non-zero, the string read functions copy the strings found into the supplied representation, otherwise they do not. Instead, the target Pstring points to memory managed by the current IO discipline. This sharing avoids copying and speeds programs that will never reference an old record after a new one is read in. Field copy_strings should only be set to zero for record-based IO disciplines where strings from record K are not used after P_io_next_rec has been called to move the IO cursor to record K+1. Note: Pstring_preserve can be used to force a string that is using sharing to make a copy so that the string is ’preserved’ (remains valid) across calls to P_io_next_rec.

15.1.5 Scanning extent

When scanning for a literal or regular expression match, how far should the scan/match go before giving up? If a record-based IO discipline is used, scanning and matching is limited to the scope of a single record. In addition, the following four Pdisc_t fields can be used to provide further constraints on scan/match scope.

match_max: This field specifies the maximum number of bytes that will be included in an inclusive pattern match attempt (see, e.g. data type Pstring_ME). If set to zero, no match_max constraint is imposed for a record-based IO discipline (other than finding end-of-record), whereas a built-in soft limit of P_BUILTIN_MATCH_MAX characters is imposed for non-record-based IO disciplnes. (The built-in limit is soft because if the match happens to get more than P_BUILTIN_MATCH_MAX characters in a single IO discipline read call it will go ahead and consider all of them. In contrast, if the discipline match_max is set explicitly to value K, then this is a hard limit: the match will only consider K characters even if more are available.)
numeric_max: This field specifies the maximum number of bytes that will be included in an attempt to read a character-based representation of a number. If non-zero, the field should be set large enough to cover any leading white space (if allowed by P_WSPACE_OK), an optional +/- sign, and the digits (dot etc. for floats) that make up the numeric value. A value of zero for numeric_max results in an end-of-record constraint for record-based IO disciplines and in a soft limit of P_BUILTIN_NUMERIC_MAX bytes for non-record-based IO disciplines.
scan_max: This field specifies the maximum number of bytes that will be considered by a normal scan that is looking for a terminating literal or a terminating regular expression (see, e.g. data type Pstring_SE.). Note that this includes both the bytes skipped plus the bytes used for the match. A scan_max of zero results in an end-of-record constraint for record-based IO disciplines and in a soft limit of P_BUILTIN_SCAN_MAX bytes for non-record-based IO disciplines.
panic_max: This field specifies the maximum number of bytes that will be considered by when parsing hits a ’panic’ state and is looking for a synchronizing literal or pattern. See, for example, termination conditions for user-defined array types. A panic_max of zero results in an end-of-record constraint for record-based IO disciplines and in a soft limit of P_BUILTIN_PANIC_MAX bytes for non-record-based IO disciplines.

For non-record-based IO disciplines, the default soft limits may be either too small or too large for a given input type. It is important for these cases to determine appropriate hard limit settings.

The built-in soft limits for use with non-record-based IO disciplines are as follows. Although you can change them and recompile the PADS library, it is easier to simply set up the correct hard limits in the discipline.

#define P_BUILTIN_MATCH_MAX 512#define P_BUILTIN_SCAN_MAX 512#define P_BUILTIN_NUMERIC_MAX 512#define P_BUILTIN_PANIC_MAX 1024

15.1.6 File open function

The field fopen_fn stores the file open function used by P_io_fopen. If this field is NULL, the default function P_fopen is used.

A Pfopen_fn takes arguments (source, mode) and opens file source in the specified mode and returns an SFIO stream on success or NULL on failure.

typedef Sfio_t* (*Pfopen_fn)(const char *source, const char *mode);

It should normally have the the same semantics as the call sfopen(NiL, string, mode), except that for the string constants it should return an existing SFIO stream:

`"/dev/stdin"`	—→	`sfstdin`
`"/dev/stdout"`	—→	`sfstdout`
`"/dev/stderr"`	—→	`sfstderr`

For "/dev/stdin", mode "r" (read) is expected. For "/dev/stdout or "/dev/stderr", mode "a" (append-only) is expected. If you use some other mode for these three cases, the function should attempt to apply mode to the specified SFIO stream; it should return NULL if this fails, otherwise the specified stream.

15.1.7 Error function

The field error_fn stores the error reporting function. If this field is NULL, the default function P_error is used. The type for this function is:

typedef int (*Perror_fn)(const char *libnm, int level, ...);

Error functions are a lot like the standard printf funtion, with two additional arguments, libnm and level. For example, where one might use printf as follows:

printf("count: %d\n", 10);

one can do the same thing with P_error using:

P_error(NULL, P_LEV_INFO, "count: %d", 10);

Note that P_error automatically adds a newline, so we did not have to include a "\n" in the format string, as we did with the printf example.

The first argument to an error function, libnm, is normally NULL (it is for the name of the library calling the error function).

The second argument, level, is used to specify the severity of the error. Level P_LEV_INFO is used for an informative (non-error) message, P_LEV_WARN is used for a warning, P_LEV_ERR for a non-fatal error, and P_LEV_FATAL for a fatal error. One can ’or’ in the following flags (as in P_LEV_WARN|P_FLG_PROMPT):

`P_FLG_PROMPT`		do not emit a newline
`P_FLG_SYSERR`		add a description of `errno` (`errno` should be a system error)
`P_FLG_LIBRARY`		error is from library

Library messages are forced if the environment variable ERROR_OPTIONS includes the string “library”; SYSERR (errno) messages are forced if it includes the string “system.”

For convenience, if the first argument, library name libnm, is non-NULL, then flag P_FLG_LIBRARY is automatically or’d into level.

15.1.8 Endian-ness

The field d_endian stores the endianness of the data. It can have the value PbigEndian or PlittleEndian. If d_endian does not equal the endiannes of the machine running the parsing code, the bye order of binary integeeres is swapped by the binary integer read function.

15.1.9 Accumulator customization

The fields acc_max2track, acc_max2rep, and acc_pct2rep allow the user to customize accumulator behavior. Section 16.2 describes these fields in detail.

15.1.10 Input time zone

The field in_time_zone specifies the default time zone for string-based time input, used for input date strings that do not have time zone information in them. For example, the date 15/Oct/1997:18:46:51 has no time zone information. If in_time_zone is set to "UTC", then this date/time would be assumed to be a UTC time. In contrast, regardless of the in_time_zone setting, the date 15/Oct/1997:18:46:51 -0700 will always be interpreted as being in a time zone seven hours earlier than UTC time.

The in_time_zone is passed to the tmzone function, so it must be a time zone description that tmzone understands. Intuitively, it accepts three-letter strings such as "PST" and "EDT" as well as strings denoting numeric offsets from UTC time, such as "-0500". Chapter 14 describes the legal time zone designation strings. Documentation for the tmzone function appears on the web page: www.research.att.com/gsf/man/man3/tm.html

Before calling P_open, the discipline field disc->in_time can be initialized directly. After calling P_open, however, it must be changed by passing the pads handle and a time zone string to P_set_in_time_zone, e.g.,

P_set_in_time_zone(pads,"PST");

This will set pads->disc->in_time_zone, and will also update an internal representation of the time zone maintained as part of the pads state.

15.1.11 Output time zone

This field specifies the output time zone for formatted time output. Regardless of the time zone used to read in a time, disc->output_time_zone controls which time zone is used when formatting the time for output. For example, a time that is read as 6am UTC time would be formatted as 1am if the output_time_zone is "-0500". Note that in the normal case you should use the same time zone for both input and output, unless you are intentially translating times from one time zone to another one. The format of output time zone specification strings is the same as for input time zone. Chapter 14 describes the legal time zone designation strings.

P_set_output_time_zone(pads,"CDT");

This will set pads->disc->output_time_zone, and will also update an internal representation of the output time zone maintained as part of the pads state.

15.1.12 Input formats

The in_formats field of the discipline allows one to specify default input formats for some special types where there is in no ’obvious’ default. Figure 15.2 contains the type of this field.

typedef struct Pin_formats_s { const char *timestamp; const char *date; const char *time; } Pin_formats_t;

Figure 15.2: The Pin_formats_t type, which allows users to specify the input format of various Pads base types. Each of the fields must be a non-null string with a format understood by the tmdate function.

The current entries are:

in_formats.timestamp

This field contains a format string specifying the input timestamp format, for use with Ptimestamp and its variants. Alternatives can be given using %|, and the special %& format can be used to indicate all formats that can be parsed by the tmdate fuction. The default,

"%m%d%y+%H%M%S%|%m%d%y+%H%M%S%|%m%d%Y+%H%M%S%|%m%d%Y+%H%M%S%|%&"

allows for timestamps of these forms:

 091172+230202
 091172+23:02:02
 09111972+230202
 09111972+23:02:02

and also allows for all date/time formats parsed by tmdate. Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html

in_formats.date

A format string specifying the input date format, for use with Pdate and its variants. The default,

"%m%d%y%|%m%d%Y%|%&"

allows for dates of these two forms:

   091172
   09111972

and also allows for all date formats parsed by tmdate. Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html

in_formats.tme

A format string specifying the input time format, for use with Ptime and its variants. The default,

"%H%M%S%|%H:%M:%S%|%&"

allows for times of these two forms:

  230202
  23:02:02

and also allows for all date formats parsed by tmdate. Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html

15.1.13 Output formats

The out_formats field of the discipline allows one to specify default output formats for some special types where there is in no ’obvious’ default. Figure 15.3 contains the type of this field.

typedef struct Pout_formats_s { const char *timestamp_explicit; const char *timestamp; const char *date_explicit; const char *date; const char *time_explicit; const char *time; } Pout_formats_t;

Figure 15.3: The Pout_formats_t type, which allows users to specify the output format of various Pads base types. Each of the fields must be a non-null string with a format understood by the fmttime function.

The current entries are:

out_formats.timestamp
out_formats.timestamp_explicit: These two values specifying the default output formats for the Ptimestamp and Ptimestamp_explicit families of types, respectively. The normal use is for these formats to describe both the date and time of day. Some examples:
"%Y%m%d|%H%M%S""%m/%d/%Y %H:%M""%K" /*default – %K is the same as "%Y-%m-%d+%H:%M:%S"*/
Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
out_formats.date
out_formats.date_explicit: These two values specifying the default output formats for the Pdate and Pdate_explicit families of types, respectively. The normal use is for these formats to describe the date but not the time of day. Some examples:
"%Y%m%d""%m/%d/%Y""%Y-%m-%d" /* default */
Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
out_formats.time
out_formats.time_explicit: These two values specifying the default output formats for the Ptime and Ptime_explicit families of types, respectively. The normal use is for these formats to describe a time of day but not the date. Some examples:
"%H%M%S""%H.%M""%H:%M:%S" /* default */
Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html

15.1.14 Writing invalid values

Write functions take a parse descriptor and a value. The value is valid if the parse descriptor’s errCode is set to P_NO_ERR. The value has been filled in if the errCode is P_USER_CONSTRAINT_VIOLATION. For other errCodes, the value should be assumed to contain garbage. For invalid values, the write function must still write SOME value. For every type, one can specify an inv_val helper function that produces an invalid value for the type, to be used by the type’s write functions. If no function is specified, then a default invalid value is used, where there are two cases: if the errCode is P_USER_CONSTRAINT_VIOLATION, then the current invalid value is used; for any other errCode, a default invalid value is used.

The map from write functions to inv_val functions is found in the discipline in the field inv_val_fn_map, which maps values of type const char * (the string form of the type name) to Pinv_val_fn functions. If the inv_val_fn_map field is NULL, no mappings are used.

An invalid value function that handles type T values always takes four arguments:

The P_t* handle
A pointer to a type T parse descriptor
A pointer to the invalid type T representation
A va_list argument that encodes the extra type parameters (if any) associated with the type. For example, type Pa_int32_FW(:<width>:) has a single type parameter (width) of type Puint32, so va_list would encode a single Puint32 argument.

Arguments two and three use void* types to enable the table to be used with arbitrary types, including user-defined types. One must cast these void* args to the appropriate pointer types before use; see the example below. The function should replace the invalid value with a new value, and return P_OK on success and P_ERR if a replacement value has not been set.

Use P_set_inv_val_fn to set a function pointer, P_get_inv_val_fn to do a lookup:

Pinv_val_fn P_get_inv_val_fn(P_t* pads, Pinv_val_fn_map_t *map, const char *type_name); Pinv_val_fn P_set_inv_val_fn(P_t* pads, Pinv_val_fn_map_t *map, const char *type_name, Pinv_val_fn fn);

Example: suppose an Pa_int32 field has an attached constraint that requires the value must be at least negative thirty. If a value of negative fifty is read, errCode will be P_USER_CONSTRAINT_VIOLATION. If no inv_val function has been provided, then the Pa_int32 write function will output -50. If the read function fails to read even a valid integer, the errCode will be P_INVALID_A_NUM, and the Pa_int32 write function will output P_MIN_INT32 (the default invalid value for all int32 write functions). If one wanted to correct all user contraint cases to use value -30, and to use P_MAX_INT32 for other invalid cases, one could provide an inv_val helper function to do so:

Perror_t my_int32_inv_val(P_t *pads, void *pd_void, void *val_void, va_list type_args) { Pbase_pd *pd = (Pbase_pd*)pd_void; Pint32 *val = (Pint32*)val_void; if (pd->errCode == P_USER_CONSTRAINT_VIOLATION) { (*val) = -30; } else { (*val) = P_MAX_INT32; } return P_OK; }; /*create call only needed if no map installed yet*/pads->disc->inv_val_fn_map = Pinv_val_fn_map_create(pads); P_set_inv_val_fn(pads, pads->disc->inv_val_fn_map, "Pint32", my_int32_inv_val);

Note that for a type T with three forms, P_T, Pa_T, and Pe_T, there is only one entry in the inv_val_fn_map, under string "P_T". For example, use "Pint32" to specify an invalid value function for all of these types: Pint32, Pa_int32, and Pe_int32.

An inv_val_fn for a string type should use Pstring_copy, Pstring_cstr_copy, Pstring_share, or Pstring_cstr_share to fill in the value of the Pstring* param.

15.2 The IO Discpline

IO discipline values, which have type Pio_disc_t, control the ’raw’ reading of data from a file or from some other data source. The Pads system provides a collection of functions for generating various IO disciplines, corresponding to various kinds of record structures: new-line terminated, fixed width, IBM-style (initial data indicating size of record, followed by payload), etc. In addition, the discipline indicates if the data source is seekable (a file) or not (a stream).

To use an IO discipline, the user first creates one by invoking a creation function supplied by the Pads system. The resulting IO discipline is then installed by passing it as an argument to either P_open or to P_set_io_disc.

Pio_disc_t * P_fwrec_make(size_t leader_len, size_t data_len, size_t trailer_len): Instantiates an instance of fwrec, a discipline for fixed-width records. The parameter data_len specifies the number of data bytes per record, while leader_len and trailer_len specify the number of bytes that occur before and after the data bytes within each record (either or both can be zero). Thus the total record size in bytes is the sum of the three arguments.
Pio_disc_t * P_fwrec_noseek_make(size_t leader_len, size_t data_len, size_t trailer_len): Instantiates an instance of fwrec_noseek, a version of norec that does not require that the SFIO stream is seekable.
Pio_disc_t * P_ctrec_make(Pbyte termChar, size_t block_size_hint);: Instantiates an instance of ctrec, a discipline for character-terminated variable-width records. Argument termChar is the character that marks the end of a record. Argument block_size_hint suggests a block size to use, if the discipline chooses to do fixed block-sized reads ’under the covers’. It may be ignored by the discipline. For ASCII newline-terminated records use, ’\n’ or P_ASCII_NEWLINE as the term character. For EBCDIC newline-terminated records, use P_EBCDIC_NEWLINE as the term character.
Pio_disc_t * P_ctrec_noseek_make(Pbyte termChar, size_t block_size_hint): Instantiates an instance of ctrec_noseek, a version of norec that does not require that the SFIO stream is seekable.
Pio_disc_t * P_nlrec_make(size_t block_size_hint): Shorthand for calling the corresponding ctrec make function with ’\n’ as termChar.
Pio_disc_t * P_nlrec_noseek_make(size_t block_size_hint): Shorthand for calling the corresponding ctrec make function with ’\n’ as termChar.
Pio_disc_t * P_vlrec_make(int blocked, size_t avg_rlen_hint): Instantiates an instance of vlrec, a discipline for IBM-style variable-length records with record length specified at the start of each record. If blocked is set (!= 0) then the records are grouped into blocks, where each block has a length given at the start of each block. Argument avg_rlen_hint is a hint as to what the average record length is, to help the discipline allocate memory. It should include the four bytes at the start of each record used for the record length. It may be ignored by the discipline.
Pio_disc_t * P_vlrec_noseek_make(int blocked, size_t avg_rlen_hint): Instantiates an instance of vlrec_noseek, a version of vlrec that does not require that the SFIO stream is seekable.
Pio_disc_t * P_norec_make(size_t block_size_hint): Instantiates an instance of norec, a raw bytes discipline that does not use records. Argument block_size_hint is a hint as to what block size to use, if the discipline chooses to do fixed block-sized reads ’under the covers’. It may be ignored by the discipline.
Pio_disc_t * P_norec_noseek_make(size_t block_size_hint): Instantiates an instance of norec_noseek, a version of norec that does not require that the SFIO stream is seekable.

15.2.1 Closing an IO discipline

When an IO discipline is no longer needed, the user should unmake it. The function P_io_disc_unmake explicitly deallocates an IO discipline. In addition, the function P_close deallocates the installed IO discipline. The function P_set_io_disc deallocates the previously installed discipline. If desired, an IO discipline can be preserved using P_close_keep_io_disc or P_set_io_disc_keep_old, in which case it can be re-used in a future P_open or P_set_io_disc call.

15.2.2 Implementations

Implementations of the standard IO disciplines can be found in libpads/default_io_disc.c. Anyone planning to implement a new IO discipline should consult default_io_disc.c.