Skip to content

Introduction

Purpose

The primary purpose of the code generator is to take an ECL query and convert it into a work unit that is suitable for running by one of the engines.

Aims

The code generator has to do its job accurately. If the code generator does not correctly map the ECL to the workunit it can lead to corrupt data and invalid results. Problems like that can often be very hard and frustrating for the ECL users to track down.

There is also a strong emphasis on generating output that is as good as possible. Eclcc contains many different optimization stages, and is extensible to allow others to be easily added.

Eclcc needs to be able to cope with reasonably large jobs. Queries that contain several megabytes of ECL, and generate tens of thousands of activities, and 10s of Mb of C++ are routine. These queries need to be processed relatively quickly.

Key ideas

Nearly all the processing of ECL is done using an expression graph. The representation of the expression graph has some particular characteristics:

  • Once the nodes in the expression graph have been created they are NEVER modified.
  • Nodes in the expression graph are ALWAYS commoned up if they are identical.
  • Each node in the expression graph is link counted (see below) to track its lifetime.
  • If a modified graph is required a new graph is created (sharing nodes from the old one)

The ECL language is a declarative language, and in general is assumed to be pure - i.e. there are no side-effects, expressions can be evaluated lazily and re-evaluating an expression causes no problems. This allows eclcc to transform the graph in lots of interesting ways. (Life is never that simple so there are mechanisms for handling the exceptions.)

From declarative to imperative

One of the main challenges with eclcc is converting the declarative ECL code into imperative C++ code. One key problem is it needs to try to ensure that code is only evaluated when it is required, but that it is also only evaluated once. It isn't always possible to satisfy both constraints - for example a global dataset expression used within a child query. Should it be evaluated once before the activity containing the child query is called, or each time the child query is called? If it is called on demand then it may not be evaluated as efficiently...

This issue complicates many of the optimizations and transformations that are done to the queries. Long term the plan is to allow the engines to support more delayed lazy-evaluation, so that whether something is evaluated is more dynamic rather than static.

Flow of processing

The idealised view of the processing within eclcc follows the following stages:

  • Parse the ECL into an expression graph.
  • Expand out function calls.
  • Normalize the expression graph so it is in a consistent format.
  • Normalize the references to fields within datasets to tie them up with their scopes.
  • Apply various global optimizations.
  • Translate logical operations into the activities that will implement them.
  • Resource and generate the global graph
  • For each activity, resource, optimize and generate its child graphs.

In practice the progression is not so clear cut. There tends to be some overlap between the different stages, and some of them may occur in slightly different orders. However the order broadly holds.

Working on the code generator

The regression suite

Before any change is accepted for the code generator it is always run against several regression suites to ensure that it doesn't introduce any problems, and that the change has the desired effect. There are several different regression suites:

  • testing/regress/ecl - The run time regression suite.
  • ecl/regress - a compiler regression suite. This contains tests that cannot run and error tests.
  • LN private suite - This contains a large selection (>10Gb) of archived queries. The contain proprietary code so unfortunately cannot be released as open source.

The ecl/regress directory contains a script 'regress.sh' that is used for running the regression tests. It should be executed in the directory containing the ecl files. The script generates the c++ code (and workunits) for each of the source files to a target directory, and then executes a comparison program to compare the new results with a previous "golden" reference set.

Before making any changes to the compiler, a reference set should be created by running the regression script and copying the generated files to the reference directory.

Here is a sample command line

~/dev/hpcc/ecl/regress/regress.sh -t /regress/hpcc -e /home/<user>/buildr/Release/bin/eclcc -I /home/<user>/dev/hpcc/ecl/regress/modules -I /home/<user>/dev/hpcc/plugins/javaembed -I /home/<user>/dev/hpcc/plugins/v8embed -c /regress/hpcc.master -d bcompare

(A version of this command resides in a shell script in each of my regression suite directories, with the -t and -c options adapted for each suite.)

For a full list of options execute the script with no parameters, or take a look at the script itself. A couple of useful options are:

  • The script can be run on a single file by using the -q option.
  • The (-e) option selects the path of the eclcc. This is particularly useful when running from the build directory (see below), or using multiple build directories to compare behaviour between different versions.

We strongly recommend using a comparison program which allows rules to be defined to ignore certain differences (e.g., beyond compare).

Running directly from the build directory

It is much quicker to run eclcc directly from the build directory, rather than deploying a system and running eclcc from there. To do this you need to configure some options that eclcc requires, e.g. where the include files are found. The options can be set by either setting environment variables or by specifiying options in an eclcc.ini file. The following are the names of the different options:


Environment flag Ini file option


CL_PATH compilerPath

ECLCC_LIBRARY_PATH libraryPath

ECLCC_INCLUDE_PATH includePath

ECLCC_PLUGIN_PATH plugins

HPCC_FILEHOOKS_PATH filehooks

ECLCC_TPL_PATH templatePath

ECLCC_ECLLIBRARY_PATH eclLibrariesPath

ECLCC_ECLBUNDLE_PATH eclBundlesPath

The eclcc.ini can either be a file in the local directory, or specified on the eclcc command line with -specs. Including the settings in a local eclcc.ini file also it easy to debug eclcc directly from the build directory within the eclipse environment.

Hints and tips

  • Logging

    There is an option for eclcc to output a logging file, and another to specify the level of detail in that logging file. If the detail level is above 500 then the expresssion tree for the query is output to the logging file after each of the code transformations. The tracing is very useful for tracking down at which stage inconsistencies are introduced in the expression graph, and also for learning how each transformation affects the query.

    The output format defaults to ECL - which is regenerated from the expression tree. (This ECL cannot generally be compiled without editing - partly because it contains extra annoations.) Use either of the following:

    eclcc myfile.ecl --logfile myfile.log --logdetail 999

    regress.sh -q myfile.ecl -l myfile.log

  • -ftraceIR

    There is a debug option (-ftraceIR) that generates an intermediate representation of the expression graph rather than regenerating ECL. The output tends to be less compact and harder to read quickly, but has the advantage of being better structured, and contains more details of the internal representation. ecl/hql/hqlir.cpp contains more details of the format.

  • Adding extra logging into the source code

    If you want to add tracing of expressions at any point in the code generation then adding either of the following calls will include the expression details in the log file:

    dbglogExpr(expr); // regenerate the ecl for an expression. See other functions in ecl/hql/hqlthql.hpp

    EclIR::dbglogIR(expr); // regenerate the IR for an expression. See other functions in ecl/hql/hqlir.hpp

  • Logging while debugging

    If you are debugging inside gdb it is often useful to be able to dump out details of an expression. Calling EclIR:dump_ir(expr); will generate the IR to stdout.

    p EclIR::dump_ir(expr)

    The function can also be used with multiple parameters. Each expression will be dumped out, but common child nodes will only be generated once. This can be very useful when trying to determine the difference between two expressions. The quickest way is to call EclIR::dump_ir(expr1, expr2). The first difference between the expressions will be the expression that follows the first "return".

  • Expression sequence ids.

    Sometimes it can be hard to determine where a particular IHqlExpression node was created. If that is the case, then defining DEBUG_TRACK_INSTANCEID (in ecl/hql/hqlexpr.ipp) will add a unique sequence number to each IHqlExpression that is created. There is also a function checkSeqId() at the start of ecl/hql/hqlexpr.cpp which is called whenever an expression is created, linked, released etc.. Setting a breakpoint in that function can allow you to trace back exactly when and why a particular node was created.

Expressions

Expression Graph representation

The key data structure within eclcc is the graph representation. The design has some key elements.

  • Once a node is created it is never modified.

    Some derived information (e.g., sort order, number of records, unique hash, ...) might be calculated and stored in the class after it has been created, but that doesn't change what the node represents in any way. Some nodes are created in stages - e.g., records, modules. These nodes are marked as fully completed when closeExpr() is called, after which they cannot be modified.

  • Nodes are always commoned up.

    If the same operator has the same arguments and type then there will be a unique IHqlExpression to represent it. This helps ensure that graphs stay as graphs and don't get converted to trees. It also helps with optimizations, and allows code duplicated in two different contexts to be brought together.

  • The nodes are link counted.

    Link counts are used to control the lifetime of the expression objects. Whenever a reference to an expression node is held, its link count is increased, and decreased when no longer required. The node is freed when there are no more references. (This generally works well, but does give us problems with forward references.)

  • The access to the graph is through interfaces.

    The main interfaces are IHqlExpression, IHqlDataset and IHqlScope. They are all defined in hqlexpr.hpp. The aim of the interfaces is to hide the implementation of the expression nodes so they can be restructured and changed without affecting any other code.

  • The expression classes use interfaces and a type field rather than polymorphism. This could be argued to be bad object design...but.

    There are more than 500 different possible operators. If a class was created for each of them the system would quickly become unwieldy. Instead there are several different classes which model the different types of expression (dataset/expression/scope).

    The interfaces contain everything needed to create and interrogate an expression tree, but they do not contain functionality for directly processing the graph.

    To avoid some of the shortcomings of type fields there are various mechanisms for accessing derived attributes which avoid interrogating the type field.

  • Memory consumption is critical.

It is not unusual to have 10M or even 100M nodes in memory as a query is being processed. At that scale the memory consumption of each node matters - so great care should be taken when considering increasing the size of the objects. The node classes contain a class hierarchy which is there purely to reduce the memory consumption - not to reflect the functionality. With no memory constraints they wouldn't be there, but removing a single pointer per node can save 1Gb of memory usage for very complex queries.

IHqlExpression

This is the interface that is used to walk and interrogate the expression graph once it has been created. Some of the main functions are: getOperator() What does this node represent? It returns a member of the node_operator enumerated type. numChildren() How many arguments does node have? queryChild(unsigned n) What is the nth child? If the argument is out of range it returns NULL. queryType() The type of this node. queryBody() Used to skip annotations (see below) queryProperty() Does this node have a child which is an attribute that matches a given name. (see below for more about attributes). queryValue() For a no_constant return the value of the constant. It returns NULL otherwise.

The nodes in the expression graph are created through factory functions. Some of the expression types have specialised functions - e.g., createDataset, createRow, createDictionary, but scalar expressions and actions are normally created with createValue().

Note: Generally ownership of the arguments to the createX() functions are assumed to be taken over by the newly created node.

The values of the enumeration constants in node_operator are used to calculate "crcs" which are used to check if the ECL for a query matches, and if disk and index record formats match. It contains quite a few legacy entries no_unusedXXX which can be used for new operators (otherwise new operators must be added to the end).

IHqlSimpleScope

This interface is implemented by records, and is used to map names to the fields within the records. If a record contains IFBLOCKs then each of the fields in the ifblock is defined in the IHqlSimpleScope for the containing record.

IHqlScope

Normally obtained by calling IHqlExpression::queryScope(). It is primarily used in the parser to resolve fields from within modules.

The ECL is parsed on demand so as the symbol is looked up it may cause a cascade of ECL to be compiled. The lookup context (HqlLookupContext ) is passed to IHqlScope::lookupSymbol() for several reasons:

  • It contains information about the active repository - the source of the ECL which will be dynamically parsed.
  • It contains caches of expanded functions - to avoid repeating expansion transforms.
  • Some members are used for tracking definitions that are read to build dependency graphs, or archives of submitted queries.

The interface IHqlScope currently has some members that are used for creation; this should be refactored and placed in a different interface.

IHqlDataset

This is normally obtained by calling IHqlExpression::queryDataset(). It has shrunk in size over time, and could quite possibly be folded into IHqlExpression with little pain.

There is a distinction in the code generator between "tables" and "datasets". A table (IHqlDataset::queryTable()) is a dataset operation that defines a new output record. Any operation that has a transform or record that defines an output record (e.g., PROJECT,TABLE) is a table, whilst those that don't (e.g., a filter, dedup) are not. There are a few apparent exceptions -e.g., IF (This is controlled by definesColumnList() which returns true the operator is a table.)

Properties and attributes

There are two related by slightly different concepts. An attribute refers to the explicit flags that are added to operators (e.g., , LOCAL, KEEP(n) etc. specified in the ECL or some internal attributes added by the code generator). There are a couple of different functions for creating attributes. createExtraAttribute() should be used by default. createAttribute() is reserved for an attribute that never has any arguments, or in unusual situations where it is important that the arguments are never transformed. They are tested using queryAttribute()/hasAttribute() and represented by nodes of kind no_attr/no_expr_attr.

The term "property" refers to computed information (e.g., record counts) that can be derived from the operator, its arguments and attributes. They are covered in more detail below.

Field references

Fields can be selected from active rows of a dataset in three main ways:

  • Some operators define LEFT/RIGHT to represent an input or processed dataset. Fields from these active rows are referenced with LEFT.<field-name>. Here LEFT or RIGHT is the "selector".

  • Other operators use the input dataset as the selector. E.g., myFile(myFile.id != 0). Here the input dataset is the "selector".

  • Often when the input dataset is used as the selector it can be omitted. E.g., myFile(id != 0). This is implicitly expanded by the PARSER to the second form. A reference to a field is always represented in the expression graph as a node of kind no_select (with createSelectExpr). The first child is the selector, and the second is the field. Needless to say there are some complications...

  • LEFT/RIGHT.

    The problem is that the different uses of LEFT/RIGHT need to be disambiguated since there may be several different uses of LEFT in a query. This is especially true when operations are executed in child queries. LEFT is represented by a node no_left(record, selSeq). Often the record is sufficient to disambiguate the uses, but there are situations where it isn't enough. So in addition no_left has a child which is a selSeq (selector sequence) which is added as a child attribute of the PROJECT or other operator. At parse time it is a function of the input dataset that is later normalized to a unique id to reduce the transformation work.

  • Active datasets. It is slightly more complicated - because the dataset used as the selector can be any upstream dataset up to the nearest table. So the following ECL code is legal:

    x := DATASET(...)
    y := x(x.id != 0);
    z := y(x.id != 100);
    

Here the reference to x.id in the definition of z is referring to a field in the input dataset.

Because of these semantics the selector in a normalized tree is actually inputDataset->queryNormalizedSelector() rather than inputDatset. This function currently returns the table expression node (but it may change in the future see below).

Attribute "new"

In some situations ECL allows child datasets to be treated as a dataset without an explicit NORMALIZE. E.g., EXISTS(myDataset.ChildDataset);

This is primarily to enable efficient aggregates on disk files to be generated, but it adds some complications with an expression of the form dataset.childdataset.grandchild. E.g.,:

EXISTS(dataset(EXISTS(dataset.childdataset.grandchild))

Or:

EXISTS(dataset.childdataset(EXISTS(dataset.childdataset.grandchild))

In the first example dataset.childdataset within the dataset.childdataset.grandchild is a reference to a dataset that doesn't have an active cursor and needs to be iterated), whilst in the second it refers to an active cursor.

To differentiate between the two, all references to fields within datasets/rows that don't have active selectors have an additional attribute("new") as a child of the select. So a no_select with a "new" attribute requires the dataset to be created, one without is a member of an active dataset cursor.

If you have a nested row, the new attribute is added to the selection from the dataset, rather than the selection from the nested row. The functions queryDatasetCursor() and querySelectorDataset()) are used to help interpret the meaning.

(An alternative would be to use a different node from no_select - possibly this should be considered - it would be more space efficient.)

The expression graph generated by the ECL parser doesn't contain any new attributes. These are added as one of the first stages of normalizing the expression graph. Any code that works on normalized expressions needs to take care to interpret no_selects correctly.

Transforming selects

When an expression graph is transformed and none of the records are changed, the representation of LEFT/RIGHT remains the same. This means any no_select nodes in the expression tree will also stay the same.

However, if the transform modifies a table (highly likely) it means that the selector for the second form of field selector will also change. Unfortunately this means that transforms often cannot be short-circuited.

It could significantly reduce the extent of the graph that needs traversing, and the number of nodes replaced in a transformed graph if this could be avoided. One possibility is to use a different value for dataset->queryNormalizedSelector() using a unique id associated with the table. I think it would be a good long term change, but it would require unique ids (similar to the selSeq) to be added to all table expressions, and correctly preserved by any optimization.

Annotations

Sometimes it is useful to add information into the expression graph (e.g., symbol names, position information) that doesn't change the meaning, but should be preserved. Annotations allow information to be added in this way.

An annotation's implementation of IHqlExpression generally delegates the majority of the methods through to the annotated expression. This means that most code that interrogates the expression graph can ignore their presence, which simplifies the caller significantly. However transforms need to be careful (see below).

Information about the annotation can be obtained by calling IHqlExpression:: getAnnotationKind() and IHqlExpression:: queryAnnotation().

Associated side-effects

In legacy ECL you will see code like the following::

EXPORT a(x) := FUNCTION
   Y := F(x);
   OUTPUT(Y);
   RETURN G(Y);
END;

The assumption is that whenever a(x) is evaluated the value of Y will be output. However that doesn't particularly fit in with a declarative expression graph. The code generator creates a special node (no_compound) with child(0) as the output action, and child(1) as the value to be evaluated (g(Y)).

If the expression ends up being included in the final query then the action will also be included (via the no_compound). At a later stage the action is migrated to a position in the graph where actions are normally evaluated.

Derived properties

There are many pieces of information that it is useful to know about a node in the expression graph - many of which would be expensive to recomputed each time there were required. Eclcc has several mechanisms for caching derived information so it is available efficiently.

  • Boolean flags - getInfoFlags()/getInfoFlags2().

    There are many Boolean attributes of an expression that are useful to know - e.g., is it constant, does it have side-effects, does it reference any fields from a dataset etc. etc. The bulk of these are calculated and stored in a couple of members of the expression class. They are normally retrieved via accessor functions e.g., containsAssertKeyed(IHqlExpression*).

  • Active datasets - gatherTablesUsed().

    It is very common to want to know which datasets an expression references. This information is calculated and cached on demand and accessed via the IHqlExpression::gatherTablesUsed() functions. There are a couple of other functions IHqlExpression::isIndependentOfScope() and IHqlExpression::usesSelector() which provide efficient functions for common uses.

  • Information stored in the type.

    Currently datasets contain information about sort order, distribution and grouping as part of the expression type. This information should be accessed through the accessor functions applied to the expression (e.g., isGrouped(expr)). At some point in the future it is planned to move this information as a general derived property (see next).

  • Other derived property.

    There is a mechanism (in hqlattr) for calculating and caching an arbitrary derived property of an expression. It is currently used for number of rows, location-independent representation, maximum record size etc. . There are typically accessor functions to access the cached information (rather than calling the underlying IHqlExpression::queryAttribute() function).

  • Helper functions.

    Some information doesn't need to be cached because it isn't expensive to calculate, but rather than duplicating the code, a helper function is provided. E.g., queryOriginalRecord() and hasUnknownTransform(). They are not part of the interface because the number would make the interface unwieldy and they can be completely calculated from the public functions.

    However, it can be very hard to find the function you are looking for, and they would greatly benefit from being grouped e.g., into namespaces.

Transformations

One of the key processes in eclcc is walking and transforming the expression graphs. Both of these are covered by the term transformations. One of the key things to bear in mind is that you need to walk the expression graph as a graph, not as a tree. If you have already examined a node once you shouldn't repeat the work - otherwise the execution time may be exponential with node depth.

Other things to bear in mind

  • If a node isn't modified don't create a new one - return a link to the old one.
  • You generally need to walk the graph and gather some information before creating a modified graph. Sometimes creating a new graph can be short-circuited if no changes will be required.
  • Sometimes you can be tempted to try and short-circuit transforming part of a graph (e.g., the arguments to a dataset activity), but because of the way references to fields within dataset work that often doesn't work.
  • If an expression is moved to another place in the graph, you need to be very careful to check if the original context was conditional and that the new context is not.
  • The meaning of expressions can be context dependent. E.g., References to active datasets can be ambiguous.
  • Never walk the expressions as a tree, always as a graph!
  • Be careful with annotations.

It is essential that an expression that is used in different contexts with different annotations (e.g., two different named symbols) is consistently transformed. Otherwise it is possible for a graph to be converted into a tree. E.g.,:

A := x; B := x; C = A + B;

must not be converted to:

A' := x'; B' := X'';  C' := A' + B';

For this reason most transformers will check if expr->queryBody() matches expr, and if not will transform the body (the unannotated expression), and then clone any annotations.

Some examples of the work done by transformations are:

  • Constant folding.
  • Expanding function calls.
  • Walking the graph and reporting warnings.
  • Optimizing the order and removing redundant activities.
  • Reducing the fields flowing through the generated graph.
  • Spotting common sub expressions.
  • Calculating the best location to evaluate an expression (e.g., globally instead of in a child query).
  • Many, many others.

Some more details on the individual transforms are given below..

Key Stages

Parsing

The first job of eclcc is to parse the ECL into an expression graph. The source for the ECL can come from various different sources (archive, source files, remote repository). The details are hidden behind the IEclSource/IEclSourceCollection interfaces. The createRepository() function is then used to resolve and parse the various source files on demand.

Several things occur while the ECL is being parsed:

  • Function definitions are expanded inline.

    A slightly unusual behaviour. It means that the expression tree is a fully expanded expression -which is better suited to processing and optimizing.

  • Some limited constant folding occurs.

    When a function is expanded, often it means that some of the test conditions are always true/false. To reduce the transformations the condition may be folded early on.

  • When a symbol is referenced from another module this will recursively cause the ECL for that module (or definition within that module) to be parsed.

  • Currently the semantic checking is done as the ECL is parsed.

    If we are going to fully support template functions and delayed expansion of functions this will probably have to change so that a syntax tree is built first, and then the semantic checking is done later.

Normalizing

There are various problems with the expression graph that comes out of the parser:

  • Records can have values as children (e.g., { myField := infield.value} ), but it causes chaos if record definitions can change while other transformations are going on. So the normalization removes values from fields.

  • Some activities use records to define the values that output records should contain (e.g., TABLE). These are now converted to another form (e.g., no_newusertable).

  • Sometimes expressions have multiple definition names. Symbols and annotations are rationalized and commoned up to aid commoning up other expressions.

  • Some PATTERN definitions are recursive by name. They are resolved to a form that works if all symbols are removed.

  • The CASE/MAP representation for a dataset/action is awkward for the transforms to process. They are converted to nested Ifs.

    (At some point a different representation might be a good idea.)

  • EVALUATE is a weird syntax. Instances are replaced with equivalent code which is much easier to subsequently process.

  • The datasets used in index definitions are primarily there to provide details of the fields. The dataset itself may be very complex and may not actually be used. The dataset input to an index is replaced with a dummy "null" dataset to avoid unnecessary graph transforming, and avoid introducing any additional incorrect dependencies.

Scope checking

Generally if you use LEFT/RIGHT then the input rows are going to be available wherever they are used. However if they are passed into a function, and that function uses them inside a definition marked as global then that is invalid (since by definition global expressions don't have any context).

Similarly if you use syntax <dataset>.<field>, its validity and meaning depends on whether <dataset> is active. The scope transformer ensures that all references to fields are legal, and adds a "new" attribute to any no_selects where it is necessary.

Constant folding: foldHqlExpression

This transform simplifies the expression tree. Its aim is to simplify scalar expressions, and dataset expressions that are valid whether or not the nodes are shared. Some examples are:

  • 1 + 2 => 3 and any other operation on scalar constants.
  • IF(true, x, y) => x
  • COUNT(<empty-dataset>) => 0
  • IF (a = b, 'c', 'd') = 'd' => IF(a=b, false, true) => a != b
  • Simplifying sorts, projects filters on empty datasets

Most of the optimizations are fairly standard, but a few have been added to cover more esoteric examples which have occurred in queries over the years.

This transform also supports the option to percolate constants through the graph. E.g., if a project assigns the value 3 to a field, it can substitute the value 3 wherever that field is used in subsequent activities. This can often lead to further opportunities for constant folding (and removing fields in the implicit project).

Expression optimizer: optimizeHqlExpression

This transformer is used to simplify, combine and reorder dataset expressions. The transformer takes care to count the number of times each expression is used to ensure that none of the transformations cause duplication. E.g., swapping a filter with a sort is a good idea, but if there are two filters of the same sort and they are both swapped you will now be duplicating the sort.

Some examples of the optimizations include:

  • COUNT(SORT(x)) => COUNT(x)
  • Moving filters over projects, joins, sorts.
  • Combining adjacent projects, projects and joins.
  • Removing redundant sorts or distributes
  • Moving filters from JOINs to their inputs.
  • Combining activities e.g., CHOOSEN(SORT(x)) => TOPN(x)
  • Sometimes moving filters into IFs
  • Expanding out a field selected from a single row dataset.
  • Combine filters and projects into compound disk read operations.

Implicit project: insertImplicitProjects

ECL tends to be written as general purpose definitions which can then be combined. This can lead to potential inefficiencies - e.g., one definition may summarise some data in 20 different ways, this is then used by another definition which only uses a subset of those results. The implicit project transformer tracks the data flow at each point through the expression graph, and removes any fields that are not required.

This often works in combination with the other optimizations. For instance the constant percolation can remove the need for fields, and removing fields can sometimes allow a left outer join to be converted to a project.

Workunits

is this the correct term? Should it be a query? This should really be independent of this document...)

The code generator ultimately creates workunits. A workunit completely describes a generated query. It consists of two parts. There is an xml component - this contains the workflow information, the various execution graphs, and information about options. It also describes which inputs can be supplied to the query and what results are generated. The other part is the generated shared object compiled from the generated C++. This contains functions and classes that are used by the engines to execute the queries. Often the xml is compressed and stored as a resource within the shared object -so the shared object contains a complete workunit.

Workflow

The actions in a workunit are divided up into individual workflow items. Details of when each workflow item is executed, what its dependencies are stored in the <Workflow> section of the xml. The generated code also contains a class definition, with a method perform() which is used to execute the actions associated with a particular workflow item. (The class instances are created by calling the exported createProcess() factory function).

The generated code for an individual workflow item will typically call back into the engine at some point to execute a graph.

Graph

The activity graphs are stored in the xml. The graph contains details of which activities are required, how those activities link together, what dependencies there are between the activities. For each activity it might contain the following information:

  • A unique id.
  • The "kind" of the activity (from enum ThorActivityKind in eclhelper.hpp)
  • The ECL that created the activity.
  • Name of the original definition
  • Location (e.g., file, line number) of the original ECL.
  • Information about the record size, number of rows, sort order etc.
  • Hints which control options for a particular activity (e.g,, the number of threads to use while sorting).
  • Record counts and stats once the job has executed.

Each activity in a graph also has a corresponding helper class instance in the generated code. (The name of the class is cAc followed by the activity number, and the exported factory method is fAc followed by the activity number.) These classes implement the interfaces defined in eclhelper.hpp.

The engine uses the information from the xml to produce a graph of activities that need to be executed. It has a general purpose implementation of each activity kind, and it uses the class instance to tailor that general activity to the specific use e.g., what is the filter condition, what fields are set up, what is the sort order?

Inputs and Results

The workunit xml contains details of what inputs can be supplied when that workunit is run. These correspond to STORED definitions in the ECL. The result xml also contains the schema for the results that the workunit will generate.

Once an instance of the workunit has been run, the values of the results may be written back into dali's copy of the workunit so they can be retrieved and displayed.

Generated code

Aims for the generated C++ code:

  • Minimal include dependencies.

    Compile time is an issue - especially for small on-demand queries. To help reduce compile times (and dependencies with the rest of the system) the number of header files included by the generated code is kept to a minimum. In particular references to jlib, boost and icu are kept within the implementation of the runtime functions, and are not included in the public dependencies.

  • Thread-safe.

    It should be possible to use the members of an activity helper from multiple threads without issue. The helpers may contain some context dependent state, so different instances of the helpers are needed for concurrent use from different contexts (e.g., expansions of a graph.)

  • Concise.

    The code should be broadly readable, but the variable names etc. are chosen to generate compact code.

  • Functional.

    Generally the generated code assigns to a variable once, and doesn't modify it afterwards. Some assignments may be conditional, but once the variable is evaluated it isn't updated. (There are of course a few exceptions - e.g., dataset iterators)

Implementation details

First a few pointers to help understand the code within eclcc:

  • It makes extensive use of link counting. You need understand that concept to get very far.

  • If something is done more than once then that is generally split into a helper function.

    The helper functions aren't generally added to the corresponding interface (e.g., IHqlExpression) because the interface would become bloated. Instead they are added as global functions. The big disadvantage of this approach is they can be hard to find. Even better would be for them to be rationalised and organised into namespaces.

  • The code is generally thread-safe unless there would be a significant performance implication. In generally all the code used by the parser for creating expressions is thread safe. Expression graph transforms are thread-safe, and can execute in parallel if a constant (NUM_PARALLEL_TRANSFORMS) is increased. The data structures used to represent the generated code are NOT thread-safe.

  • Much of the code generation is structured fairly procedurally, with classes used to process the stages within it.

  • There is a giant "God" class HqlCppTranslator - which could really do with refactoring.

Parser

The eclcc parser uses the standard tools bison and flex to process the ECL and convert it to a

: expression graph. There are a couple of idiosyncrasies with the way it is implemented.

  • Macros with fully qualified scope.

    Slightly unusually macros are defined in the same way that other definitions are - in particular to can have references to macros in other modules. This means that there are references to macros within the grammar file (instead of being purely handled by a pre-processor). It also means the lexer keeps an active stack of macros being processed.

  • Attributes on operators.

    Many of the operators have optional attributes (e.g., KEEP, INNER, LOCAL, ...). If these were all reserved words it would remove a significant number of keywords from use as symbols, and could also mean that when a new attribute was added it broke existing code. To avoid this the lexer looks ahead in the parser tables (by following the potential reductions) to see if the token really could come next. If it can't then it isn't reserved as a symbol.

Generated code

As the workunit is created the code generator builds up the generated code and the xml for the workunit. Most of the xml generation is encapsulated within the IWorkUnit interface. The xml for the graphs is created in an IPropertyTree, and added to the workunit as a block.

C++ Output structures

The C++ generation is ultimately controlled by some template files (thortpl.cpp). The templates are plain text and contain references to allow named sections of code to be expanded at particular points.

The code generator builds up some structures in memory for each of those named sections. Once the generation is complete some peephole optimization is applied to the code. This structure is walked to expand each named section of code as required.

The BuildCtx class provides a cursor into that generated C++. It will either be created for a given named section, or more typically from another BuildCtx. It has methods for adding the different types of statements. Some are simple (e.g., addExpr()), whilst some create a compound statement (e.g., addFilter). The compound statements change the active selector so any new statements are added within that compound statement.

As well as building up a tree of expressions, this data structure also maintains a tree of associations. For instance when a value is evaluated and assigned to a temporary variable, the logical value is associated with that temporary. If the same expression is required later, the association is matched, and the temporary value is used instead of recalculating it. The associations are also used to track the active datasets, classes generated for row-meta information, activity classes etc. etc.

Activity Helper

Each activity in an expression graph will have an associated class generated in the C++. Each different activity kind expects a helper that implements a particular IHThorArg interface. E.g., a sort activity of kind TAKsort requires a helper that implements IHThorSortArg. The associated factory function is used to create instances of the helper class.

The generated class might take one of two forms:

  • A parameterised version of a library class. These are generated for simple helpers that don't have many variations (e.g., CLibrarySplitArg for TAKsplit), or for special cases that occur very frequently (CLibraryWorkUnitReadArg for internal results).
  • A class derived from a skeleton implementation of that helper (typically CThorXYZ implementing interface IHThorXYZ). The base class has default implementations of some of the functions, and any exceptions are implemented in the derived class.

Meta helper

This is a class that is used by the engines to encapsulate all the information about a single row -e.g., the format that each activity generates. It is an implementation of the IOutputMeta interface. It includes functions to

  • Return the size of the row.
  • Serialize and deserialize from disk.
  • Destroy and clean up row instances.
  • Convert to xml.
  • Provide information about the contained fields.

Building expressions

The same expression nodes are used for representing expressions in the generated C++ as the original ECL expression graph. It is important to keep track of whether an expression represents untranslated ECL, or the "translated" C++. For instance ECL has 1 based indexes, while C++ is zero based. If you processed the expression x[1] it might get translated to x[0] in C++. Translating it again would incorrectly refer to x[-1].

There are two key classes used while building the C++ for an ECL expression:

CHqlBoundExpr.

This represents a value that has been converted to C++. Depending on the type, one or more of the fields will be filled in.

CHqlBoundTarget.

This represents the target of an assignment -C++ variable(s) that are going to be assigned the result of evaluating an expression. It is almost always passed as a const parameter to a function because the target is well-defined and the function needs to update that target.

A C++ expression is sometimes converted back to an ECL pseudo-expression by calling getTranslatedExpr(). This creates an expression node of kind no_translated to indicate the child expression has already been converted.

Scalar expressions

The generation code for expressions has a hierarchy of calls. Each function is there to allow optimal code to be generated - e.g., not creating a temporary variable if none are required. A typical flow might be:

  • buildExpr(ctx, expr, bound).

    Evaluate the ecl expression "expr" and save the C++ representation in the class bound. This might then call through to...

  • buildTempExpr(ctx, expr, bound);

    Create a temporary variable, and evaluate expr and assign it to that temporary variable.... Which then calls.

  • buildExprAssign(ctx, target, expr);

    evaluate the expression, and ensure it is assigned to the C++ target "target".

    The default implementation might be to call buildExpr....

An operator must either be implemented in buildExpr() (calling a function called doBuildExprXXX) or in buildExprAssign() (calling a function called doBuildAssignXXX). Some operators are implemented in both places if there are different implementations that would be more efficient in each context.

Similarly there are several different assignment functions:

  • buildAssign(ctx, <ecl-target>, <ecl-value>);
  • buildExprAssign(ctx, <c++-target>, <ecl-value>);
  • assign(ctx, <C++target>, <c++source>)

The different varieties are there depending on whether the source value or targets have already been translated. (The names could be rationalised!)

Datasets

Most dataset operations are only implemented as activities (e.g., PARSE, DEDUP). If these are used within a transform/filter then eclcc will generate a call to a child query. An activity helper for the appropriate operation will then be generated.

However a subset of the dataset operations can also be evaluated inline without calling a child query. Some examples are filters, projects, and simple aggregation. It removes the overhead of the child query call in the simple cases, and often generates more concise code.

When datasets are evaluated inline there is a similar hierarchy of function calls:

  • buildDatasetAssign(ctx, target, expr);

    Evaluate the dataset expression, and assign it to the target (a builder interface). This may then call....

  • buildIterate(ctx, expr)

    Iterate through each of the rows in the dataset expression in turn. Which may then call...

  • buildDataset(ctx, expr, target, format)

    Build the entire dataset, and return it as a single value.

Some of the operations (e.g., aggregating a filtered dataset) can be done more efficiently by summing and filtering an iterator, than forcing the filtered dataset to be evaluated first.

Dataset cursors

The interface IHqlCppDatasetCursor allows the code generator to iterate through a dataset, or select a particular element from a dataset. It is used to hide the different representation of datasets, e.g.,

  • Blocked - the rows are in a contiguous block of memory appended one after another.
  • Array - the dataset is represented by an array of pointers to the individual rows.
  • Link counted - similar to array, but each element is also link counted.
  • Nested. Sometimes the cursor may iterate through multiple levels of child datasets.

Generally rows that are serialized (e.g., on disk) are in blocked format, and they are stored as link counted rows in memory.

Field access classes

The IReferenceSelector interface and the classes in hqltcppc[2] provide an interface for getting and setting values within a row of a dataset. They hide the details of the layout - e.g., csv/xml/raw data, and the details of exactly how each type is represented in the row.

Key filepos weirdness

The current implementation of keys in HPCC uses a format which uses a separate 8 byte integer field which was historically used to store the file position in the original file. Other complications are that the integer fields are stored big-endian, and signed integer values are biased.

This introduces some complication in the way indexes are handled. You will often find that the logical index definition is replaced with a physical index definition, followed by a project to convert it to the logical view. A similar process occurs for disk files to support VIRTUAL(FILEPOSITION) etc.

Source code

The following are the main directories used by the ecl compiler.


Directory Contents


rtl/eclrtpl Template text files used to generate the C++ code

rtl/include Headers that declare interfaces implemented by the generated code

common/deftype Interfaces and classes for scalar types and values.

common/workunit Code for managing the representation of a work unit.

ecl/hql Classes and interfaces for parsing and representing an ecl expression graph

ecl/hqlcpp Classes for converting an expression graph to a work unit (and C++)

ecl/eclcc The executable which ties everything together.

Challenges

From declarative to imperative

As mentioned at the start of this document, one of the main challenges with eclcc is converting the declarative ECL code into imperative C++ code. The direction we are heading in is to allow the engines to support more lazy-evaluation so possibly in this instance to evaluate it the first time it is used (although that may potentially be much less efficient). This will allow the code generator to relax some of its current assumptions.

There are several example queries which are already producing pathological behaviour from eclcc, causing it to generate C++ functions which are many thousands of lines long.

The parser

Currently the grammar for the parser is too specialised. In particular the separate productions for expression, datasets, actions cause problems - e.g., it is impossible to properly allow sets of datasets to be treated in the same way as other sets.

The semantic checking (and probably semantic interpretation) is done too early. Really the parser should build up a syntax tree, and then disambiguate it and perform the semantic checks on the syntax tree.

The function calls should probably be expanded later than they are. I have tried in the past and hit problems, but I can't remember all the details. Some are related to the semantic checking.

Released under the Apache-2.0 License.