Working with files¶
Data files commonly used in genomics applications are often very large (tens or even hundreds of Gb), and therefore can be unwieldy to work with.
As such, the Explorer SDK comes with an inbuilt abstraction of a File
object, which represents a file which may or may not be located on the machine on which code is being run; i.e., it provides a way to specify and manipulate (both downloading and uploading) remote files.
This effectively allows working with files in a "lazy" manner, where the abstraction of a file can be worked with but the actual downloading of a potentially very large file is not effected until an explicit method call is invoked.
In this section, we will describe the File
object at a high level, and describe the various methods available to access and manipulate (read and write) remote files.
The File
Object¶
Overview¶
The File
object represents a file, typically at a remote location. File
objects can be used to transfer files to local storage on a user’s Explorer instance or to temporary Explorer storage on S3.
💡 Note that “local” in this section refers to your Explorer Instance storage.
Accessing a remote file¶
In cases where you would like to retrieve a file that already exists at an accessible URL or S3 location, you can use the url
or path_s3
parameters to File
.
Once a File
object that refers to a remote file is created, you can make a local copy via the as_local()
method.
Retrieving remote files from a URL and S3:
from gencove_explorer.models import File
f1 = File(url="https://...")
f2 = File(path_s3="s3://bucket/prefix")
Copying remote files to Explorer storage:
# copy from URL (URL must be publicly accessible)
r1_local = f1.as_local()
# copy file from S3 (you must have necessary IAM permissions)
r2_local = f2.as_local()
# print path to local copy of files
print(r1_local)
print(r2_local)
Additionally, you can specify the destination path (and name) to which to copy the file:
# Copy remote file to ~/my_files/file.txt
r1_local = f1.as_local(path_local="~/file.txt")
# The copy can overwrite if it already exists with 'force'
r1_local = f1.as_local(path_local="~/file.txt", force=True)
# This will print the full path to local copy of the file
print(r1_local)
Copying local files from an Explorer instance to S3 via File
¶
Users can copy local files from the Explorer instance by supplying both the name
and path_local
parameters to a new File
object. The File
object can then be retrieved later by referring to the original name
. Note that additional details on name
can be found here.
💡 Note that the name
value must be unique. This value is used to determine the S3 destination for files uploaded to Explorer S3 storage.
from gencove_explorer.models import File
from pathlib import Path
# Create empty example file
phenotypes = Path("./phenotypic_data.txt")
phenotypes.touch()
# Create File object
phenotypes_file = File(name="phenotypes", path_local=phenotypes)
# Copy the local file object to remote storage
phenotypes_file.upload()
# Retrieve from remote via the name parameter
# e.g. this could be done from within an analysis function
phenotypes_file_remote = File(name="phenotypes")
downloaded_phenotypes = phenotypes_file_remote.as_local()
It is also possible to upload unnamed files to an Explorer S3 temporary storage location. To do this, exclude the name
parameter and only set path_local
. Example:
from gencove_explorer.models import File, Path
# Create empty example file
phenotypes_temporary = Path("./phenotypes_temporary.txt")
phenotypes_temporary.touch()
# Create File object with no name
phenotypes_file = File(path_local=phenotypes_temporary)
# Upload to temporary location
phenotypes_file.upload()
The name
parameter¶
The name
parameter is intended to allow users to assign a unique name to files so they can be uploaded or retrieved to/from the user’s Explorer S3 storage.
When creating a File
object with only the name
parameter set, e.g.
The following logic is executed:
- Checks if name already exists on user’s Gencove Explorer S3 storage
-
If the name exists, it can be copied to local storage via
as_local()
e.g. -
If the name does not exist, a temporary local path is generated. This path can then be written to, then
.upload()
can be called to copy the file to your Explorer S3 storage. Additional details on.upload()
can be found in the SDK Reference documentation.
Generating URLs for remote File
objects¶
It is possible to generate a URL for a file in Explorer S3 storage via the as_url()
method. Here is an end to end example where we copy a local file to Explorer S3 storage, then obtain a URL for it.
from gencove_explorer.models import File
# Make dummy file
example_file = Path("./example.txt")
example_file.touch()
# Create File object with no name
f = File(path_local=example_file)
# Copy to temporary location on S3
f.upload()
# Generate URL
file_url = f.as_url()
# Print out URL for file
print(file_url)
Sharing files within the organization¶
Files are uploaded to the user Explorer storage on S3 by default, however a file can be uploaded to a shared space, which all users from the same organization have access.
To manipulate (both downloading and uploading) shared files you only have to set org_shared
to True.