The Pads core library is parameterized by a main discpline and by an IO discipline that allow users to customize the behavior of the core system and the IO subsystem, respectively. One aspect of controlling IO is choosing what constitues a Precord in the data source.
A Pads handle, i.e., a value of type P_t, contains a field named my_disc of type Pdisc_t, which appears in Figure 15.1. The various fields of my_disc allow users to customize aspects of the Pads system. We describe these fields in the following sections.
typedef struct Pdisc_s {
Pflags_t version; /* interface version */
Pflags_t flags; /* control flags */
Pcharset def_charset; /* default char set */
int copy_strings; /* if set, ASCII string read
functions copy strings */
/* For the next four values, 0 means end-of-record /
soft limit for non-record-based IO disciplines */
size_t match_max; /* max match distance */
size_t numeric_max; /* max numeric value distance */
size_t scan_max; /* max normal scan distance */
size_t panic_max; /* max panic scan distance */
Pfopen_fn fopen_fn; /* file open function (default P_fopen) */
Perror_fn error_fn; /* error function using ... */
PerrorRep e_rep; /* controls error reporting */
Pendian_t d_endian; /* endian-ness of the data */
Puint64 acc_max2track; /* default maximum distinct values for
accumulators to track */
Puint64 acc_max2rep; /* default maximum number of tracked values
to describe in detail in report */
Pfloat64 acc_pcnt2rep; /* default maximum percent of values to
describe in detail in report */
const char *in_time_zone; /* default time zone for time input,
specified as a string */
const char *out_time_zone; /* default time zone for time formatted
output, specified as a string */
Pin_formats_t in_formats; /* default input formats */
Pout_formats_t out_formats; /* default output formats */
Pinv_val_fn_map_t *inv_val_fn_map; /* map types to inv_val_fn
for write functions */
Pfmt_fn_map_t *fmt_fn_map; /* map types to fmt functions */
Pio_disc_t *io_disc; /* sub-discipline for controlling IO */
} Pdisc_t;
This field stores the interface version of the Pads library. The C macro P_VERSION refers to the current version.
The flags field is a combination of bit fields, described in the following sections. At the moment, only one such field is in use.
If the P_WSPACE_OK bit is set in the flags field, then leading white space is allowed for variable-width ASCII integers. Similarly, leading and/or trailing white space is allowed for fixed-width ASCII integers.
Pads supports ASCII and EBCDIC encodings. The def_charset field indicates the character set for interpreting base types with no explicit character encoding by storing one of the following values:
Pcharset_ASCII Pcharset_EBCDIC
If copy_string field in Pads discipline is non-zero, the string read functions copy the strings found into the supplied representation, otherwise they do not. Instead, the target Pstring points to memory managed by the current IO discipline. This sharing avoids copying and speeds programs that will never reference an old record after a new one is read in. Field copy_strings should only be set to zero for record-based IO disciplines where strings from record K are not used after P_io_next_rec has been called to move the IO cursor to record K+1. Note: Pstring_preserve can be used to force a string that is using sharing to make a copy so that the string is ’preserved’ (remains valid) across calls to P_io_next_rec.
When scanning for a literal or regular expression match, how far should the scan/match go before giving up? If a record-based IO discipline is used, scanning and matching is limited to the scope of a single record. In addition, the following four Pdisc_t fields can be used to provide further constraints on scan/match scope.
For non-record-based IO disciplines, the default soft limits may be either too small or too large for a given input type. It is important for these cases to determine appropriate hard limit settings.
The built-in soft limits for use with non-record-based IO disciplines are as follows. Although you can change them and recompile the PADS library, it is easier to simply set up the correct hard limits in the discipline.
#define P_BUILTIN_MATCH_MAX 512
#define P_BUILTIN_SCAN_MAX 512
#define P_BUILTIN_NUMERIC_MAX 512
#define P_BUILTIN_PANIC_MAX 1024
The field fopen_fn stores the file open function used by P_io_fopen. If this field is NULL, the default function P_fopen is used.
A Pfopen_fn takes arguments (source, mode) and opens file source in the specified mode and returns an SFIO stream on success or NULL on failure.
typedef Sfio_t* (*Pfopen_fn)(const char *source, const char *mode);
It should normally have the the same semantics as the call sfopen(NiL, string, mode), except that for the string constants it should return an existing SFIO stream:
"/dev/stdin" | —→ | sfstdin |
"/dev/stdout" | —→ | sfstdout |
"/dev/stderr" | —→ | sfstderr |
For "/dev/stdin", mode "r" (read) is expected. For "/dev/stdout or "/dev/stderr", mode "a" (append-only) is expected. If you use some other mode for these three cases, the function should attempt to apply mode to the specified SFIO stream; it should return NULL if this fails, otherwise the specified stream.
The field error_fn stores the error reporting function. If this field is NULL, the default function P_error is used. The type for this function is:
typedef int (*Perror_fn)(const char *libnm, int level, ...);
Error functions are a lot like the standard printf funtion, with two additional arguments, libnm and level. For example, where one might use printf as follows:
printf("count: %d\n", 10);
one can do the same thing with P_error using:
P_error(NULL, P_LEV_INFO, "count: %d", 10);
Note that P_error automatically adds a newline, so we did not have to include a "\n" in the format string, as we did with the printf example.
The first argument to an error function, libnm, is normally NULL (it is for the name of the library calling the error function).
The second argument, level, is used to specify the severity of the error. Level P_LEV_INFO is used for an informative (non-error) message, P_LEV_WARN is used for a warning, P_LEV_ERR for a non-fatal error, and P_LEV_FATAL for a fatal error. One can ’or’ in the following flags (as in P_LEV_WARN|P_FLG_PROMPT):
P_FLG_PROMPT | do not emit a newline | |
P_FLG_SYSERR | add a description of errno (errno should be a system error) | |
P_FLG_LIBRARY | error is from library |
Library messages are forced if the environment variable ERROR_OPTIONS includes the string “library”; SYSERR (errno) messages are forced if it includes the string “system.”
For convenience, if the first argument, library name libnm, is non-NULL, then flag P_FLG_LIBRARY is automatically or’d into level.
The field d_endian stores the endianness of the data. It can have the value PbigEndian or PlittleEndian. If d_endian does not equal the endiannes of the machine running the parsing code, the bye order of binary integeeres is swapped by the binary integer read function.
The fields acc_max2track, acc_max2rep, and acc_pct2rep allow the user to customize accumulator behavior. Section 16.2 describes these fields in detail.
The field in_time_zone specifies the default time zone for string-based time input, used for input date strings that do not have time zone information in them. For example, the date 15/Oct/1997:18:46:51 has no time zone information. If in_time_zone is set to "UTC", then this date/time would be assumed to be a UTC time. In contrast, regardless of the in_time_zone setting, the date 15/Oct/1997:18:46:51 -0700 will always be interpreted as being in a time zone seven hours earlier than UTC time.
The in_time_zone is passed to the tmzone function, so it must be a time zone description that tmzone understands. Intuitively, it accepts three-letter strings such as "PST" and "EDT" as well as strings denoting numeric offsets from UTC time, such as "-0500". Chapter 14 describes the legal time zone designation strings. Documentation for the tmzone function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
Before calling P_open, the discipline field disc->in_time can be initialized directly. After calling P_open, however, it must be changed by passing the pads handle and a time zone string to P_set_in_time_zone, e.g.,
P_set_in_time_zone(pads,"PST");
This will set pads->disc->in_time_zone, and will also update an internal representation of the time zone maintained as part of the pads state.
This field specifies the output time zone for formatted time output. Regardless of the time zone used to read in a time, disc->output_time_zone controls which time zone is used when formatting the time for output. For example, a time that is read as 6am UTC time would be formatted as 1am if the output_time_zone is "-0500". Note that in the normal case you should use the same time zone for both input and output, unless you are intentially translating times from one time zone to another one. The format of output time zone specification strings is the same as for input time zone. Chapter 14 describes the legal time zone designation strings.
Before calling P_open, the discipline field disc->in_time can be initialized directly. After calling P_open, however, it must be changed by passing the pads handle and a time zone string to P_set_output_time_zone, e.g.,
P_set_output_time_zone(pads,"CDT");
This will set pads->disc->output_time_zone, and will also update an internal representation of the output time zone maintained as part of the pads state.
The in_formats field of the discipline allows one to specify default input formats for some special types where there is in no ’obvious’ default. Figure 15.2 contains the type of this field.
typedef struct Pin_formats_s {
const char *timestamp;
const char *date;
const char *time;
} Pin_formats_t;
The current entries are:
"%m%d%y+%H%M%S%|%m%d%y+%H%M%S%|%m%d%Y+%H%M%S%|%m%d%Y+%H%M%S%|%&"
allows for timestamps of these forms:
091172+230202 091172+23:02:02 09111972+230202 09111972+23:02:02
and also allows for all date/time formats parsed by tmdate. Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
"%m%d%y%|%m%d%Y%|%&"
allows for dates of these two forms:
091172 09111972
and also allows for all date formats parsed by tmdate. Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
"%H%M%S%|%H:%M:%S%|%&"
allows for times of these two forms:
230202 23:02:02
and also allows for all date formats parsed by tmdate. Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
The out_formats field of the discipline allows one to specify default output formats for some special types where there is in no ’obvious’ default. Figure 15.3 contains the type of this field.
typedef struct Pout_formats_s {
const char *timestamp_explicit;
const char *timestamp;
const char *date_explicit;
const char *date;
const char *time_explicit;
const char *time;
} Pout_formats_t;
The current entries are:
"%Y%m%d|%H%M%S"
"%m/%d/%Y %H:%M"
"%K" /*default – %K is the same as "%Y-%m-%d+%H:%M:%S"*/
Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
"%Y%m%d"
"%m/%d/%Y"
"%Y-%m-%d" /* default */
Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
"%H%M%S"
"%H.%M"
"%H:%M:%S" /* default */
Documentation for the tmdate function appears on the web page: www.research.att.com/gsf/man/man3/tm.html
Write functions take a parse descriptor and a value. The value is valid if the parse descriptor’s errCode is set to P_NO_ERR. The value has been filled in if the errCode is P_USER_CONSTRAINT_VIOLATION. For other errCodes, the value should be assumed to contain garbage. For invalid values, the write function must still write SOME value. For every type, one can specify an inv_val helper function that produces an invalid value for the type, to be used by the type’s write functions. If no function is specified, then a default invalid value is used, where there are two cases: if the errCode is P_USER_CONSTRAINT_VIOLATION, then the current invalid value is used; for any other errCode, a default invalid value is used.
The map from write functions to inv_val functions is found in the discipline in the field inv_val_fn_map, which maps values of type const char * (the string form of the type name) to Pinv_val_fn functions. If the inv_val_fn_map field is NULL, no mappings are used.
An invalid value function that handles type T values always takes four arguments:
Arguments two and three use void* types to enable the table to be used with arbitrary types, including user-defined types. One must cast these void* args to the appropriate pointer types before use; see the example below. The function should replace the invalid value with a new value, and return P_OK on success and P_ERR if a replacement value has not been set.
Use P_set_inv_val_fn to set a function pointer, P_get_inv_val_fn to do a lookup:
Pinv_val_fn P_get_inv_val_fn(P_t* pads, Pinv_val_fn_map_t *map,
const char *type_name);
Pinv_val_fn P_set_inv_val_fn(P_t* pads, Pinv_val_fn_map_t *map,
const char *type_name, Pinv_val_fn fn);
Example: suppose an Pa_int32 field has an attached constraint that requires the value must be at least negative thirty. If a value of negative fifty is read, errCode will be P_USER_CONSTRAINT_VIOLATION. If no inv_val function has been provided, then the Pa_int32 write function will output -50. If the read function fails to read even a valid integer, the errCode will be P_INVALID_A_NUM, and the Pa_int32 write function will output P_MIN_INT32 (the default invalid value for all int32 write functions). If one wanted to correct all user contraint cases to use value -30, and to use P_MAX_INT32 for other invalid cases, one could provide an inv_val helper function to do so:
Perror_t my_int32_inv_val(P_t *pads, void *pd_void, void *val_void, va_list type_args) {
Pbase_pd *pd = (Pbase_pd*)pd_void;
Pint32 *val = (Pint32*)val_void;
if (pd->errCode == P_USER_CONSTRAINT_VIOLATION) {
(*val) = -30;
} else {
(*val) = P_MAX_INT32;
}
return P_OK;
};
/*create call only needed if no map installed yet*/
pads->disc->inv_val_fn_map = Pinv_val_fn_map_create(pads);
P_set_inv_val_fn(pads, pads->disc->inv_val_fn_map, "Pint32", my_int32_inv_val);
Note that for a type T with three forms, P_T, Pa_T, and Pe_T, there is only one entry in the inv_val_fn_map, under string "P_T". For example, use "Pint32" to specify an invalid value function for all of these types: Pint32, Pa_int32, and Pe_int32.
An inv_val_fn for a string type should use Pstring_copy, Pstring_cstr_copy, Pstring_share, or Pstring_cstr_share to fill in the value of the Pstring* param.
IO discipline values, which have type Pio_disc_t, control the ’raw’ reading of data from a file or from some other data source. The Pads system provides a collection of functions for generating various IO disciplines, corresponding to various kinds of record structures: new-line terminated, fixed width, IBM-style (initial data indicating size of record, followed by payload), etc. In addition, the discipline indicates if the data source is seekable (a file) or not (a stream).
To use an IO discipline, the user first creates one by invoking a creation function supplied by the Pads system. The resulting IO discipline is then installed by passing it as an argument to either P_open or to P_set_io_disc.
When an IO discipline is no longer needed, the user should unmake it. The function P_io_disc_unmake explicitly deallocates an IO discipline. In addition, the function P_close deallocates the installed IO discipline. The function P_set_io_disc deallocates the previously installed discipline. If desired, an IO discipline can be preserved using P_close_keep_io_disc or P_set_io_disc_keep_old, in which case it can be re-used in a future P_open or P_set_io_disc call.
Implementations of the standard IO disciplines can be found in libpads/default_io_disc.c. Anyone planning to implement a new IO discipline should consult default_io_disc.c.