Previous Up Next

Chapter 3  Common features

In this chapter, we describe the Pads features shared by all types. Subsequent chapters describe features particular to individual Pads types.

3.1  Pads types

Each Pads type specifies the external representation of a particular kind of data. Pads base types describe the representations of atomic pieces of data, while structured types specify how compound representations are built from more basic ones. Pads provides a large and extensible collection of base types and a family of type constructors for building structured types: Pstructs for record-like sequences, Punions for alternatives, Parrays for sequences, Penums for fixed collections of strings, and Ptypedefs for refinements of existing types.

Syntactically, a Pads type declaration (p_ty_decl) must have the following form:

p_ty_decl ::= base_ty(* Chapter 4 *)
 struct_ty(* Chapter 5 *)
 union_ty(* Chapter 6 *)
 array_ty(* Chapter 7*)
 enum_ty(* Chapter 8 *)
 opt_ty(* Chapter 9 *)
 typedef_ty(* Chapter 10 *)
 trans_ty(* Chapter 11 *)
 try_ty(* Chapter 12 *)
 charclass_ty(* Section 3.14.1 *)

We use the terminal p_ty_name for an identifier bound to one of the above types.

The grammar for each of the above non-terminals is given in the associated chapter.

3.2  Comments

In addition to C and C++ style comments of the form /* ... */ and //..., which may appear anywhere in a Pads description, Pads also supports p_comments, which may appear only in particular locations in the grammar. Syntactically,

p_comment ::= /- text

where text is a new-line terminated sequence of characters. Pads comments are reflected to the generated .h file as documentation. We indicate where such comments may appear in the source as the locations arise in the descriptions of the various Pads features.

3.3  Predicates

Pads descriptions permit the user to supply predicates for validating semantic properties of syntactically correct data. Syntactically, such predicates are arbitrary C expressions of integer type. Predicates that evaluate to false (i.e., 0) imply the data is invalid, while all other values imply the data is valid. Predicates are assumed to be side-effect free.

To allow users to express constraints involving the size and position of physical data, Pads supports p_parsecheck expressions within predicates. Syntactically,

p_parsecheck ::= Pparsecheck(aug_expression)

In the production, aug_expression is an integer-valued C expression that is allowed to refer to special constants providing location information. The precise constants depend upon the context of the Pparsecheck clause in the Pads description, but always include the constant position of type Ppos_t, bound to the current position in the physical source.

For integration with general predicates, Pparsecheck expressions are treated as C expressions with type int.

3.4  Literals

Pads supports C-style character and string literals, referred to in the Pads grammar as char_lit and str_lit, respectively. These literals may contain C character escapes such as \" and \'. Pads also supports regular expression literals, described in more detail in Section 3.14, and the special literal Peor, which denotes the end of a record. We use the non-terminal regexp_lit to refer to regular expressions. Syntactically,

p_coreliteral ::= char_litstr_litPre regexp_litPeor
p_literal ::= p_coreliteralC_identifer Pfrom (p_coreliteral)

Literals (as opposed to core literals) also support a renaming form. The supplied C identifier gives the programmatic name for the literal, while the core literal supplied in the Pfrom clause describes the on-disk representation. Renaming can useful when the literal is not a valid C identifier.

3.5  Character Sets

The library discipline contains a field def_charset that indicates the expected character set of the external representation of character and string literals, as well as the external representation of all character and string base types that do not explicitly name a character set. Supported character sets include ASCII (Pcharset_ASCII) and EBCDIC (Pcharset_EBCDIC), where ASCII is the default character set. Section 15.1.3 describes how to set def_charset.

3.6  Parameterization

To reduce the number of necessary type declarations and to permit the format of later portions of the data to depend upon earlier portions, Pads types can be parameterized by values. A common example of the latter use is a data source which first specifies the length of a sequence and then gives the sequence itself. The length is read, stored in the in-memory representation, and then passed to the type that describes the sequence to specify the termination condition for the sequence. Syntactically,

p_actual_list ::= expressionexpression, p_actual_list
p_actuals ::= (: p_actual_list :)
p_ty ::= p_ty_name [p_actuals]

where expression is any C expression. Formal parameter lists are similar to C’s, except they are deliminated differently:

p_formals ::= (: c_formal_list :)

3.6.1  Example

The formalParamExample type declaration in the following Pads fragment illustrates declaring parameters to Pads types, while the actualParamExample illustrates passing parameters.

Pstruct formalParamExample(:Pint32 limit:){
  Pint32 data  :  data <= limit;
}

Pstruct actualParamExample{
  Pint32 limit;
  formalParamExample(:limit:) nestedData;
}

Type formalParamExample expects its single field data to be less than a supplied value limit, while the type actualParamExample describes an external representation with two pieces: an integer limit and then an instance of formalParamExample. The value of limit is passed as the actual parameter to the formalParamExample Pstruct.

3.7  Precord modifier

The Precord modifier may be used as an annotation on any Pads type, indicating that the type describes a record in the external representation of the data. Pads supports a number of different interpretations of what constitutes a record:

Record typeDescription
New-line terminatedEnd of record marked by new-line character
Fixed-width recordRecords contain a specified number of bytes
IBM-style recordRecord header indicates size

Section 15.2 describes how to set the appropriate record discipline for a data source.

3.8  Psource modifier

The Psource modifier may be used as an annotation on any Pads type, indicating that the type in question describes the entirety of the external representation of the data.

3.9  Pinclude

The Pinclude statement within a .p file causes the supplied argument to be mirrored to the generated .h file, but not the generated .c file. It is useful for including necessary header files for C code referenced in the Pads description. For example, the following use causes the generated .h file to include the directive #include <rpc/rpc.h>.

Pinclude(:#include <rpc/rpc.h>:)

3.10  Error model

During parsing, Pads read functions detect when the data does not conform to the given specification. Detected violations fall into two classes, which differ in their severity. The less severe of these, semantic errors are those in which the parser detects a violation of the specified format but does not “lose its place.” A typical example of this kind of error is a violation of a user-specified constraint, such as a requirement that a given field be greater than a threshold value. The more severe type of error, syntactic errors, involve the parser finding raw data that cannot be reconciled with the physical aspects of the description. Typical examples include failing to find literals required by Pstruct declarations or separators from Parray declarations. We say that a read function enters panic mode when it detects such an error. Read functions set the P_Panic flag in the pstate field of the appropriate parse descriptor when entering panic mode (Section 3.13 describes the general role and structure of parse descriptors).

Pads read functions attempt to recover from panic mode by scanning for possible synchronization points in the data source. For example, if the read function for a Pads type foo annotated with the Precord qualifier enters panic mode, it tries to recover by scanning to the end of the record. If it succeeds in finding the record boundary, it lowers the panic flag in the foo parse descriptor, although the nested parse descriptor for the portion of the data description that caused the panic will still be set. The parse descriptions for all portions of the data description that were skipped during the scanning process will also have the panic flag set. Type-specific information regarding error-recovery appears in the corresponding chapter.

3.11  In-memory representations

Each Pads type foo has an associated in-memory representation type of the same name. The structure of this representation depends upon the particular Pads type. In general, these representations fall into two broad categories: static representations, whose size can be computed at library-generation time, and dynamic representations, whose size depends on the data being parsed. Details appear in Section 4.1, Section 5.2.1, Section 6.2.2, Section 7.3.1, Section 8.2.1, Section 10.2.1, and Section 11.2.1.

3.12  Masks

Each Pads type foo has an associated mask type, foo_m. Masks allow the library user to customize operations on portions of the associated data. The structure of the mask for a given Pads type mirrors the structure of the representation type. Details about the structure for various types appear in Section 4.2, Section 5.2.2, Section 6.2.3, Section 7.3.2, Section 8.2.2, Section 10.2.2, and Section 11.2.2. Different operations in the generated library interpret masks differently. Details about how a given operation treats its mask argument appear in the sections describing the operations.

3.13  Parse descriptors

Each Pads type foo has an associated parse descriptor type, foo_pd, coded as a C struct with at least the following four fields:

FieldDescription
Puint32 pstateFlags that describe the state after parsing the associated value.
PerrCode_t errCodeA code indicating the nature of the first detected error. Appendix A contains a list of all error codes and describes their meanings.
Ploc_t locThe location in the data source of the first error.
Puint32 nerrThe number of errors detected during parsing of the associated value.

Field pstate contains the following flags:

PPanicSet if the parser was in panic mode during the parsing of the associated data. See Section 3.10 for more information.

The Pads library provides a collection of functions (macros actually) for manipulating the parse state field:

void P_PS_init(void *pd);
void P_PS_setPanic(void *pd);
void P_PS_unsetPanic(void *pd);
int  P_PS_isPanic(void *pd);

The loc field record the location of the related data in the source file. A Ppos_t (IO position) has a byte position within the num’th read unit, where the read unit is determined by the IO discipline. A description of the read unit (e.g., "record", "1K Block", etc.) can be obtained using the function P_io_read_unit described in Chapter 14 There is also an offset field which gives the absolute offset of the location within the currently installed IO stream.

typedef struct Ppos_s {
  size_t       byte;
  size_t       num;
  Sfoff_t      offset;
} Ppos_t;

A Ploc_t (IO location) has two positions, b and e, marking the first byte and the last byte where something interesting happened, e.g., a field with an invalid format.

struct Ploc_s { Ppos_t b; Ppos_t e; };

In cases where clearcut boundaries for an error are not known, the parse position where the error was ’found’ is used for both the begin and end positions. In this case, and in some other cases, the end byte is set to one less than the start byte, indicating an error that occurred just before the start byte (as opposed to an error that spans the start byte).

The beginning location for a data item is always filled in. The ending location is only filled in if there is an error.

Details about how parse descriptors are customized for various Pads types appear in Section 4.3, Section 5.2.3, Section 6.2.4, Section 7.3.3, Section 8.2.3, Section 10.2.3 and Section 11.2.3.

3.14  Regular Expressions

p_regexp_lit ::= "/ regexp /"
p_regexp_expression ::= Pre expression

Pads regular expressions support the full posix regex specification www.opengroup.org/onlinepubs/009695399/ , and also support many of the Perl extensions. If you have Perl installed, you can use

> man perlre

to see Perl’s regular expression man page.

A regular expression is specified in a Pads description as a string (a const char*). The first character in the string is the expression delimiter: the next (non-escaped) occurrence of this delimiter marks the end of the regular expression. We typically write our examples using slash (/) as the delimiter, but any delimiter can be used. After the closing delimiter, one can add one or more single-character modifiers which change the normal matching behavior. The modifiers are based on those supported by Perl, and currently include:

ModifierMeaning
  l   Treat the pattern as a literal. All characters in the pattern are literal characters to be found in the input. There are no operators or special characters.
  i   Do case-insensitve pattern matching.
  x   Extend your pattern’s legibility by permitting whitespace and comments.

Tells the regular expression parser to ignore whitespace that is neither backslashed nor within a character class You can use this to break up your regular expression into (slightly) more readable parts. The "#" character is also treated as a metacharacter introducing a comment. This also means that if you want real whitespace or "#" characters in the pattern (outside a character class, where they are unaffected by the x modifier), you’ll either have to escape them or encode them using octal or hex escapes. Be careful not to include the pattern delimiter in the comment – there is no way of knowing you did not intend to close the pattern early.

  ?   Minimal match. Change from the normal maximal left-most match semantics to a minimal left-most match semantics.
  f   First match. Change from the normal maximal left-most match semantics to accepting the first match found. This may be useful for terminating regular expressions where any match is sufficient to trigger termination. For termination, the matched characters are not included in the resulting value, so getting the best set of matching characters may not be necessary.

It is important to note that in normal POSIX regexps, the ’$’ and ’^’ special characters match “beginning of line” and “end of line” respectively, where newline is the line separator character. In contrast, in Pads regular expressions, the ’$’ and ’^’ special characters match “beginning of record” and “end of record” respectively (and thus they only have meaning with the record-based IO disciplines). For this reason, newlines that occur within records or within input data for non-record-based input are treated as normal characters with no special semantics. This means, for example, that the ’.’ special character will match newlines. (In Perl one would use the "/s" modifier to get similar behavior.)

If newlines in your input data mark record boundaries, you should be using one of the nlrec IO disciplines described in Section 15.2, in which case the newlines do not appear in your normal input, so there is no issue of ’.’ matching newlines, and ’$’ and ’^’ will have their normal POSIX behavior.

A regular expression that uses both ’^’ and ’$’ may have problems matching arbitrarily large strings because the implementation divides the input into chunks of a particular size for processing. If the records in a data source are larger than this size, the regular expression will not have access to the entire record for matching. In more detail, the matching code finds a region [begin, end] to match over and determines whether begin is actually the beginning of the record and whether end is actually the end of the record. If begin is not really the beginning of the record, then it disables ’^’, and if end is not really the end of the record, then it disables ’$’. This implementation choice will effectively prevent a successful match if the regular expression uses ’^’ or ’$’ unless the regular expression has alternation with a clause that does not use the disabled ’^’ or ’$’. The scope of matching is controlled by the Pads discipline, as described in Section 15.1.5.

Regular expressions are used for two purposes in Pads, and the matching semantics with respect the current IO position are different for these two cases, as follows.

Within regular expressions, one can write in brackets [] a set of characters to be matched against, or the inverse of such a set:

  [abc]   matches an ’a’, ’b’, or ’c’
  [^abc]   matches any character EXCEPT an ’a’, ’b’, or ’c’

INSIDE of one of these bracket expressions one can include a character class using the syntax [:<classname>:]. For example, the following matches either a letter (’A’ through ’Z’ or ’a’ through ’z’) or a ’0’ or ’1’:

[0[:alpha:]1]

Using character classes is preferable to writing something like this:

[0A-Za-z1]

because the letters A-Z may not occur contiguosly in all character set encodings. Note that when you just specify a character class within brackets, you end up with a double set of brackets, as in this pattern representing one more alpha characters:

/[[:alpha:]]+/

The following are all built-in character classes:

  [:alnum:]   alpha or digit
  [:alpha:]   upper or lower alphabet character
  [:blank:]   space (’ ’) or tab
  [:cntrl:]   control character
  [:digit:]   digit (0 through 9)
  [:graph:]   any printable character except space
  [:lower:]   lower-case letter
  [:print:]   any printable character including space
  [:punct:]   any printable character which is not a space or an alphanumeric character
  [:space:]   a white-space character. Normally this includes: space, form-feed (’\f’), newline (’\n’), carriage return (’\r’), horizontal tab (’\t’), and vertical tab (’\v’)
  [:upper:]   an upper-case letter
  [:word:]an alphanumeric character or an underscore (’_’)
  [:xdigit:]   a hexadecimal digit (normal digits and A through F)

3.14.1  Defining your own character classes

It is possible to define your own character class in a Pads file and then use that class in regular expressions that occur later in the file.

Syntax

charclass_ty ::= Pcharclass identifier { identifier};

In this grammar, the first identifier names the character class, while the second names a predicate function which takes a char as an argument and returns an int as a result. Intuitively, this function returns a non-zero value to indicate that the argument character belongs to the class and a zero to indicate it does not.

For example, the following code defines the foo character class:

int is_foo(char c) { return (c == ’f’) || (c == ’o’) || isdigit((int)c); };

Pcharclass foo {is_foo};

Section 14.1 describes compiling regular expressions.

3.15  Expressions

Expression forms include C expressions, regular expressions, and the special symbol Peor.

p_expression ::= expressionPre expressionPeor

3.16  Operations

For each Pads type, the generated library contains a collection of functions for manipulating the associated data. For structured types, the Pads compiler generates the functions; for base types, the Pads library provides them. This section describes the common features of such functions. Type-specific information may be found in the appropriate chapter.

All operations take a pointer to an initialized Pads handle as their first parameter. Information about how to manage Pads library handles appears in Chapter 14.

Many of the operations that can fail return a value of type Perror_t to indicate success or failure. This type has two values: P_OK and P_ERR.

3.16.1  Initialization and cleanup functions

For each type foo, the generated library contains initialization and cleanup functions for the associated representation foo and parse descriptor foo_pd types. Each initialization function take a pointer to allocated space and initializes the space appropriately. Each cleanup function takes a pointer to allocated and initialized space and deallocates any memory allocated by Pads functions. It does not deallocate the space pointed to by the parameter.

Perror_t foo_init (P_t *pads, foo *rep);

Perror_t foo_cleanup (P_t *pads, foo *rep);

Perror_t foo_pd_init (P_t *pads, foo_pd *pd);

Perror_t foo_pd_cleanup (P_t *pads, foo_pd *pd);

Because all masks have statically-known size, the library does not contain initialization and cleanup functions for masks. Instead, it contains a function for setting all nested base masks to a specified value:

void foo_m_init (P_t *pads, foo_m *mask, Pbase_m baseMask);

The function takes two parameters (in addition to the Pads handle): a pointer to allocated space for the mask and a base mask value. Because mask initialization functions cannot fail, the return type is void instead of Perror_t.

3.16.2  Utilty functions

Each type foo comes equipped with copy functions for both the in-memory representation and the parse descriptor. Both the source and destination pointers are assumed to point to allocated space. In addition, the source pointers are assumed to point to initialized space.

Perror_t foo_copy (P_t *pads, foo *rep_dst, foo *rep_src);

Perror_t foo_pd_copy (P_t *pads, foo_pd *pd_dst, foo_pd *pd_src);

Each type foo also has a predicate function that returns true if the supplied in-memory representation satisfies all of the non-Pparsecheck constraints.

int foo_verify(P_t *pads, foo *rep);

In addition, each structured type foo has a function foo_genPD that takes as an argument an in-memory representation and calculates a corresponding parse descriptor.

int foo_genPD (P_t *pads, foo *rep, foo_pd *pd);

The function assumes that the out parameter pd has already been allocated and initialized to zero. It sets the error codes and counts based on the semantic predicates found in the description of foo. The location information in the argument pd will not be touched. Future versions of the parse descriptor generation function may compute on-disk sizes. The function returns true if the representation contains no errors and false otherwise. The function takes the pads handle as an argument because it must allocate space for array parse descriptors.

3.16.3  Read function

The read function for a Pads type foo takes as an input parameter a pointer to a mask m and returns as output parameters a pointer to a parse descriptor pd and a pointer to an in-memory reprsentation rep. If any errors occur during the parsing, the function will return P_ERR. Otherwise, it will return P_OK.

Perror_t foo_read (P_t *pads, foo_m *m, foo_pd *pd, foo *rep);

The mask argument allows the library user to specify independently which constraints the parser should check and which portions of the in-memory representation it should fill in. Conceptually, there are three different “knobs” associated with each atomic element of a Pads type:

Flag nameDefinition
P_SynCheckCheck syntactic constraints
P_SymCheckCheck semantic constraints
P_SetSet the in-memory representation

At the base-type level, the mask consists of one or more of the above flags. For structured types, the mask consists of a combination of base masks and the masks associated with nested types, the exact combination of which depends upon the kind of structured type. Note that for a structured type foo, the mask initialization function foo_m_init can be used to initialize all nested masks to the supplied value.

Details about the read functions for particular types may be found in the appropriate chapters.

Flags can be combined using the bit-wise OR operator (vertical bar, or ’|’). For example, one can write P_Set|P_SynCheck|P_SemCheck to combine the first three flags. For convenience, pads.h provides the following abbreviations:

#define P_CheckAndSet P_Set|P_SynCheck|P_SemCheck
#define P_BothCheck P_SynCheck|P_SemCheck
#define P_Ignore no flags set (do as little work as possible to process this field)

The library also provides macros for testing and modifying base read-function flags, which are listed in Figure 3.1.


#define P_Test_Set(m) (m & P_Set)
#define P_Test_SynCheck(m) (m & P_SynCheck)
#define P_Test_SemCheck(m) (m & P_SemCheck)
 
#define P_Test_NotSet(m) (!P_Test_Set(m))
#define P_Test_NotSynCheck(m) (!P_Test_SynCheck(m))
#define P_Test_NotSemCheck(m) (!P_Test_SemCheck(m))
 
#define P_Test_CheckAndSet(m) ((m & P_CheckAndSet) == P_CheckAndSet)
#define P_Test_BothCheck(m) ((m & P_CheckAndSet) == P_BothCheck)
#define P_Test_Ignore(m) ((m & P_CheckAndSet) == P_Ignore)
 
#define P_Test_NotCheckAndSet(m) ((m & P_CheckAndSet) != P_CheckAndSet)
#define P_Test_NotBothCheck(m) ((m & P_CheckAndSet) != P_BothCheck)
#define P_Test_NotIgnore(m) ((m & P_CheckAndSet) != P_Ignore)
 
#define P_Do_Set(m) (m |= P_Set)
#define P_Do_SynCheck(m) (m |= P_SynCheck)
#define P_Do_SemCheck(m) (m |= P_SemCheck)
 
#define P_Dont_Set(m) (m &= (~P_Set))
#define P_Dont_SynCheck(m) (m &= ( P_SynCheck))
#define P_Dont_SemCheck(m) (m &= ( P_SemCheck))
Figure 3.1: Provided macros for setting and testing base read-function flags.

3.16.4  Write functions

For each Pads type, the generated library provides two functions for writing out the in-memory representation of the type in a format as close as possible to its original form.

ssize_t foo_write2io (P_t *pads, Sfio_t *io, foo_pd *pd, foo *rep);

ssize_t foo_write2buf (P_t *pads, Pbyte *buf, size_t buf_len,
                        
int *buf_full, foo_pd *pd, foo *rep);

The first of these functions emits its output to an SFIO file; the second to an in-memory buffer. Information about SFIO may be found from www.research.att.com/sw/tools/sfio . For the buffer version, the parameter buf points to an allocated sequence of bytes and buf_len indicates the size of the buffer. Out parameter buf_full is a boolean which is set if the requested write would have overflowed the buffer. The return value for both of the functions is the number of bytes written, with -1 indicating an error occurred.

Issues that may cause the written data to differ from the original data include skipping white space (Section 15.1.2) and omitting fields in Pstructs (Section 5.1.6).

Passing the -wnone flag to the Pads compiler suppresses the generation of write functions for structured types.

3.16.5  Additional functions

In addition to the basic functionality described in the preceeding sections, Pads provides more advanced features described in later chapters.

Accumulators (Chapter 16)Provide structures and functions for automatically summarizing data.
Histograms (Chapter 17)Provide structures and functions for computing histograms.
Clustering (Chapter 18)Provide structures and functions for clustering data.
Formatting (Chapter 19)Provide functions for formatting data in forms suitable for inclusion in relational databases.
Xml (Chapter 20)Provide functions for converting data into canonical Xml form as well as generating a corresponding XSchema.

Previous Up Next