Schedule data pipeline
Scheduling a data pipeline allows us to run the pipeline automatically at specific intervals. For instance, we can define a pipeline to run at 2 pm every day or 8 am every Monday.
To schedule a pipeline, go to the Load view and click Schedule pipeline.
The Schedule data pipeline window appears.
First, we enter the pipeline name. We can also choose the start date of the pipeline. Click the calendar icon to select a start date.
We can choose the frequency of the pipeline runs. It could be hourly, daily, weekly, or monthly. We can also specify the time in which the pipeline will be run.
Tip
When scheduling a pipeline from a S3 folder, the schedule frequency of the pipeline should be consistent with the frequency of arrival of new micro-batches to the S3 folder. For instance, if an external extraction process automatically uploads a micro-batch every hour, we can set the pipeline schedule to run hourly.
If we want the pipeline to be run right after scheduling it, tick Run pipeline now in addition to the scheduled time.
Loading strategies and eviction periods
The loading strategy is an important parameter when scheduling a pipeline. There are three loading strategies.
Generate a new log every time the pipeline is executed creates a new log without overwriting the existing ones. When we choose this option, we will be required to enter the log name for the new files.
Always append data to the same log add the new rows to the existing log file. Effectively, when the pipeline triggers, it appends data to the initial log file.
If the schema of the appended log changes at any time, the pipeline run will fail. To prevent this, we can tick Create new log when the schema is altered. Apromore will instead create another log when it notices a change in the log schema.
To prevent the resulting log file from becoming extremely huge, we can optionally specify the data range to retain.
If one month is selected, data older than one month in the previous dataset will be discarded.
Note
If we delete the log file created from previous pipeline runs, a new log file will be created in the following pipeline run.
Overwrite the log everytime the pipeline is executed always replaces the existing log file with the output of the pipeline execution. To prevent pipeline run failure due to a change in the schema, we can tick Create new log when the schema is altered.
Lastly, enter the log file’s name created from the pipeline and specify its path.
Click Schedule to schedule the pipeline. The pipeline will be successfully scheduled and can be managed in the Data pipeline management window.