Storage Management
Big data pipeline frequently produce large amounts of intermediate data, stored on the filesystem both because it isn’t practical to store it all in RAM and because this allows for a pipeline which fails part way through to be restarted without redoing all the work from scratch. The Martian runtime’s VDR (volatile data removal) feature is intended for removing this intermediate data once it is no longer needed.
Volatile Data Removal
A call to a stage can be marked volatile
by specifying
call STAGE_NAME(
arg1 = value,
) using (
volatile = true,
)
If a stage is marked volatile it is eligible for volatile data removal once all stages which depend on it are complete. When the “VDR killer” is invoked, all data files owned by the stage will be deleted (except those files specified in output parameters of the top-level pipeline), freeing up disk space. Job metadata is retained, and the total amount of freed space is recorded.
In --vdrmode=rolling
, the VDR killer is invoked whenever any stage completes.
In --vdrmode=post
it is invoked when the pipeline completes.
Mrp’s default is --vdrmode=rolling
, however for development purposes, one
may wish to set --vdrmode=disable
to preserve intermediate results.
Additionally, when VDR is not disabled, all stages which split will have their chunks’ files cleaned out by VDR when all dependent stages have completed.
Strict-mode VDR
To enable more aggressive file cleanup, a stage can be marked “strict-mode compatible” when it is declared:
stage STAGE_NAME(
in int value,
out txt summary,
out gz[] archives,
out json index,
) using (
volatile = strict,
)
This is telling the runtime that the stage should not be producing any files
which other stages depend on without explicitly mentioning them in an output.
As a best practice, all new stages should opt in to this behavior. All calls
to stages declared using volatile = strict
are implicitly volatile,
regardless of the state of the volatile
modifier on the call
.
When a stage is marked strict-mode compatible, rather than waiting for all dependent stages to complete and then deleting all of the files (or just the chunk files if the top-level pipeline’s outputs are bound to any of the outs), each of the stage’s output parameters is checked for file paths. Files with paths which are not mentioned in the stage’s outputs are deleted immediately when the stage completes. Other files are deleted when there are no longer any incomplete stages depending on the parameter which mentions that file.
In the example above, imagine a pipeline
pipeline COMPLEX_VDR(
in int value,
out txt output1,
out int output2,
out int output3,
)
{
call STAGE_NAME(
value = self.value,
)
call STAGE_2(
value = STAGE_NAME.archives,
)
call STAGE_3(
value = STAGE_NAME.index,
)
return (
output1 = STAGE_NAME.summary,
output2 = STAGE_2.output,
output3 = STAGE_3.output,
)
}
summary
will be deleted immediately unless COMPLEX_VDR
’s output1
is
bound to another stage’s inputs or the top-level pipeline’s outputs (or if
COMPLEX_VDR
itself is the top-level pipeline). Otherwise it will be once
those stages have completed, or never if it’s bound to the top-level outputs.
deleted. The files mentioned in archives
will be deleted as soon as
STAGE_2
completes successfully, and index
will be deleted as soon as
STAGE_3
completes successfully. Note that this means that index
should not
contain paths to the files listed in archives
, because those files have
different lifetimes.
Retained outputs
Frequently during debugging, and occasionally in other circumstances, it is desirable to preserve a file after a pipeline completes even if it is not part of the formal outputs, for example if one wants to later rerun a subset of the pipeline which depends on that file, or if one wants to be able to access a more “raw” form of the output.
“Retained” outputs are treated from a VDR perspective as if they were bound to the top-level pipeline’s outputs - they are never deleted.
A stage can be declared as retaining some outputs:
stage STAGE_NAME(
in int value,
out txt summary,
out gz[] archives,
out json index,
) using (
volatile = strict,
) retain (
summary,
)
This prevents summary
from ever being deleted by VDR, regardless of whether
it is bound to anything else. This should be used mainly for cases of small
output files which are important for later debugging.
Additionally, pipelines can declare retained parameters, e.g.
pipeline PIPELINE_NAME(
...
)
{
call PIPELINE_1(
...
)
call STAGE_2(
...
PIPELINE_1.output1,
)
return (
...
)
retain (
PIPELINE_1.output1,
)
}
This means that the files mentioned in output1
of PIPELINE_1
will never be
deleted by VDR. This is the preferred method to use during debugging to
preserve outputs which may be required for rerunning a later stage, in this
example STAGE_2
. It is preferred in part because it puts the retain
declaration closer to where the value is being used (and thus more clear about
why it’s being retained) and because the stage to which PIPELINE_1.output1
is
eventually bound might be called in other places where retention is not
required.