Globus CLI Batch Transfer Recipe
November 11, 2019 | Rick Wagner
We’d like to share this walkthrough recently written up in GitHub for how to use the Globus CLI to list, filter, and batch submit a transfer from two locations into a single destination folder.
This content was originally written up as part of our “automation-examples” GitHub repository – look there for the most current version, as well as examples of other ways to simplify data orchestration with Globus.
In this example we're going to submit transfers from two directories on a single Globus endpoint and have the data copied a single common directory. This can be used to aggregate results from different simulations or other jobs. It will also show how to do a lot of things with the Globus CLI along the way. This example can be useful if you deal with hundreds or thousands of files and directories at a single time.
To get started, you'll need to the have the Globus CLI installed and be logged in. See the getting started section of the automation-examples README.
Get the Endpoint UUIDs
We're going to copy data from ALCF's Theta to Petrel, the storage system used to support community data repositories. Globus makes heavy use of UUIDs to refer to things like endpoints, so we'll search for them.
$ globus endpoint search theta
ID | Owner | Display Name
------------------------------------ | ----------------- | --------------
08925f04-569f-11e7-bef8-22000b9a448b | alcf@globusid.org | alcf#dtn_theta
$ globus endpoint search petrel#e3sm
ID | Owner | Display Name
------------------------------------ | ------------------- | ------------
dabdceba-6d04-11e5-ba46-22000b92c6ec | petrel@globusid.org | petrel#e3sm
Set Environment Variables to Track Things
Memorizing UUIDs is not a recommended practice. We'll set environment variables to track them.
$ theta_ep=08925f04-569f-11e7-bef8-22000b9a448b
$ petrel_e3sm_ep=dabdceba-6d04-11e5-ba46-22000b92c6ec
While we're at it, we'll set our source and destination directories to prevent typos and errors.
$ run1_path=/lus/theta-fs0/projects/example/run1/
$ run2_path=/lus/theta-fs0/projects/example/run1/
$ e3sm_path=/users/rick/watertable/
Check Endpoint Activation
If the endpoint isn't activated, go to the Globus web app, search for the endpoint by name or UUID and you'll be prompted for credentials to activate it. The destination in this example is a shared endpoint which will be auto-activated by the Globus CLI.
$ globus endpoint is-activated $theta_ep
08925f04-569f-11e7-bef8-22000b9a448b is activated
Exit: 0
$ globus endpoint is-activated $petrel_e3sm_ep
dabdceba-6d04-11e5-ba46-22000b92c6ec does not require activation
Exit: 0
List Source Files
The globus ls works a lot like ls on a POSIX command line and we can use the --filter option to save us from parsing the full list.
$ globus ls --filter '~*watertable.h0*' $theta_ep:$run1_path > run1_watertable_files.txt
$ globus ls --filter '~*watertable.h0*' $theta_ep:$run2_path > run2_watertable_files.txt
The batch transfer expects a list of source files and their corresponding destination filenames. In this case, those are the same and our files will have lines like: <sourcefile name> <source filename>. (If we wanted to move the entire directory this would be a bit easier, we would use a recursive transfer. But we want to only move some of the files from the source directories.)
$ for i in `cat run1_watertable_files.txt `
do
echo "$i $i"
done > run1_watertable_files_src_dest.txt
$ for i in `cat run2_watertable_files.txt `
do
echo "$i $i"
done > run2_watertable_files_src_dest.txt
Batch Submit the Transfers
The base Globus CLI transfer command is
$ globus transfer <source ep UUID>:<source path> <destination ep UUID>:<destination path>
The --batch option to the transfer command will read the stdin input from the file line by line to build the transfer request. The source and destination paths from the input files are relative to the paths we specify using <source endpoint UUID>:<source path> and <destination endpoint UUID>:<destination path>.
You could submit one transfer per file, but then you would have a lot of tasks to monitor and the underlying Globus Connect servers would not be able efficiently aggregate the files. In other words, that's too much work and would be slower.
$ globus transfer --batch $theta_ep:$run1_path $petrel_e3sm_ep:$e3sm_path < run1_watertable_files_src_dest.txt
Message: The transfer has been accepted and a task has been created and queued for execution
Task ID: 1d499566-01ab-11ea-be94-02fcc9cdd752
$ globus transfer --batch $theta_ep:$run2_path $petrel_e3sm_ep:$e3sm_path < run2_watertable_files_src_dest.txt
Message: The transfer has been accepted and a task has been created and queued for execution
Task ID: 15173a2e-01ab-11ea-be94-02fcc9cdd752
Check Status on the Transfers
You can monitor the tasks using the web app or with the CLI. Here, I've waited long enough for them to have finished. Since this example was within Argonne for a few hundreds of gigabytes, that's not surprising. Your transfer rates may vary.
run1 Transfer
$ globus task show 1d499566-01ab-11ea-be94-02fcc9cdd752
Label: None
Task ID: 1d499566-01ab-11ea-be94-02fcc9cdd752
Is Paused: False
Type: TRANSFER
Directories: 0
Files: 121
Status: SUCCEEDED
Request Time: 2019-11-07 22:08:24+00:00
Faults: 0
Total Subtasks: 242
Subtasks Succeeded: 242
Subtasks Pending: 0
Subtasks Retrying: 0
Subtasks Failed: 0
Subtasks Canceled: 0
Subtasks Expired: 0
Completion Time: 2019-11-07 22:09:57+00:00
Source Endpoint: alcf#dtn_theta
Source Endpoint ID: 08925f04-569f-11e7-bef8-22000b9a448b
Destination Endpoint: petrel#e3sm
Destination Endpoint ID: dabdceba-6d04-11e5-ba46-22000b92c6ec
Bytes Transferred: 44631218808
Bytes Per Second: 480727214
run2 Transfer
$ globus task show 15173a2e-01ab-11ea-be94-02fcc9cdd752
Label: None
Task ID: 15173a2e-01ab-11ea-be94-02fcc9cdd752
Is Paused: False
Type: TRANSFER
Directories: 0
Files: 481
Status: SUCCEEDED
Request Time: 2019-11-07 22:08:11+00:00
Faults: 0
Total Subtasks: 962
Subtasks Succeeded: 962
Subtasks Pending: 0
Subtasks Retrying: 0
Subtasks Failed: 0
Subtasks Canceled: 0
Subtasks Expired: 0
Completion Time: 2019-11-07 22:11:27+00:00
Source Endpoint: alcf#dtn_theta
Source Endpoint ID: 08925f04-569f-11e7-bef8-22000b9a448b
Destination Endpoint: petrel#e3sm
Destination Endpoint ID: dabdceba-6d04-11e5-ba46-22000b92c6ec
Bytes Transferred: 177418316088
Bytes Per Second: 901833634
List the New Files
As a quality check on my file lists, etc., I will list the number of files that are now on the common destination to the number of source files.
$ globus ls $petrel_e3sm_ep:$e3sm_path > petrel_files.txt
$ wc petrel_files.txt
601 601 54070 petrel_files.txt
$ wc run2_watertable_files.txt run1_watertable_files.txt
481 481 43270 run2_watertable_files.txt
121 121 10890 run1_watertable_files.txt
602 602 54160 total
Hmmm. Off by one...turns out there was a common file in both directories. It's worth checking for collisions like that when you copy different things to the same destination. That's not just for Globus. The POSIX command line can also be unforgiving. Remember: Windows has a trash can, POSIX has an incinerator.
Cleaning Up
Speaking of incinerators, as you copy data around, sometimes it's just to stage data for another section of the pipeline, which was this case. After the consolidated data was processed and moved to its next location, we should remove the intermediate directory.
BE AWARE when using globus delete, especially globus delete -r. It's just like being on the command line; if you have write permissions to the target of that command, it's going away. I'm considering a pull request for the Globus CLI to have a software Easter egg where incinerate is a valid alias for delete.
$ globus delete -r $petrel_e3sm_ep:$e3sm_path
Message: The delete has been accepted and a task has been created and queued for execution
Task ID: 30762c84-01c0-11ea-8a5e-0e35e66293c2
This article originally appeared on the Globus github site - check here for updates: https://github.com/globus/automation-examples/blob/master/batch.md#run1-transfer