Skip to content

Working with files

Data files commonly used in genomics applications are often very large (tens or even hundreds of Gb), and therefore can be unwieldy to work with. As such, the Explorer SDK comes with an inbuilt abstraction of a File object, which represents a file which may or may not be located on the machine on which code is being run; i.e., it provides a way to specify and manipulate (both downloading and uploading) remote files. This effectively allows working with files in a "lazy" manner, where the abstraction of a file can be worked with but the actual downloading of a potentially very large file is not effected until an explicit method call is invoked.

In this section, we will describe the File object at a high level, and describe the various methods available to access and manipulate (read and write) remote files.

The File Object

File object

Overview

The File object represents a file, typically at a remote location. File objects can be used to transfer files to local storage on a user’s Explorer instance or to temporary Explorer storage on S3.

💡 Note that “local” in this section refers to your Explorer Instance storage.

Accessing a remote file

In cases where you would like to retrieve a file that already exists at an accessible URL or S3 location, you can use the url or path_s3 parameters to File.

Once a File object that refers to a remote file is created, you can make a local copy via the as_local() method.

Retrieving remote files from a URL and S3:

from gencove_explorer.models import File

f1 = File(url="https://...")
f2 = File(path_s3="s3://bucket/prefix")

Copying remote files to Explorer storage:

# copy from URL (URL must be publicly accessible)
r1_local = f1.as_local()

# copy file from S3 (you must have necessary IAM permissions)
r2_local = f2.as_local()

# print path to local copy of files
print(r1_local)
print(r2_local)

Additionally, you can specify the destination path (and name) to which to copy the file:

# Copy remote file to ~/my_files/file.txt
r1_local = f1.as_local(path_local="~/file.txt")

# The copy can overwrite if it already exists with 'force'
r1_local = f1.as_local(path_local="~/file.txt", force=True)

# This will print the full path to local copy of the file
print(r1_local)

Copying local files from an Explorer instance to S3 via File

Users can copy local files from the Explorer instance by supplying both the name and path_local parameters to a new File object. The File object can then be retrieved later by referring to the original name. Note that additional details on name can be found here.

💡 Note that the name value must be unique. This value is used to determine the S3 destination for files uploaded to Explorer S3 storage.

from gencove_explorer.models import File
from pathlib import Path

# Create empty example file
phenotypes = Path("./phenotypic_data.txt")
phenotypes.touch()

# Create File object
phenotypes_file = File(name="phenotypes", path_local=phenotypes)

# Copy the local file object to remote storage
phenotypes_file.upload()

# Retrieve from remote via the name parameter
# e.g. this could be done from within an analysis function
phenotypes_file_remote = File(name="phenotypes")
downloaded_phenotypes = phenotypes_file_remote.as_local()

It is also possible to upload unnamed files to an Explorer S3 temporary storage location. To do this, exclude the name parameter and only set path_local. Example:

from gencove_explorer.models import File, Path

# Create empty example file
phenotypes_temporary = Path("./phenotypes_temporary.txt")
phenotypes_temporary.touch()

# Create File object with no name
phenotypes_file = File(path_local=phenotypes_temporary)

# Upload to temporary location
phenotypes_file.upload()

The name parameter

The name parameter is intended to allow users to assign a unique name to files so they can be uploaded or retrieved to/from the user’s Explorer S3 storage.

When creating a File object with only the name parameter set, e.g.

f = File(name="my_unique_key")

The following logic is executed:

  1. Checks if name already exists on user’s Gencove Explorer S3 storage
  2. If the name exists, it can be copied to local storage via as_local() e.g.

    f_downloaded = f.as_local()
    
  3. If the name does not exist, a temporary local path is generated. This path can then be written to, then .upload() can be called to copy the file to your Explorer S3 storage. Additional details on .upload() can be found in the SDK Reference documentation.

    f3_path = f3.as_local()
    with open(f3_path, "w") as f:
        f.writelines("foobar")
    f3_path.upload()
    

Generating URLs for remote File objects

It is possible to generate a URL for a file in Explorer S3 storage via the as_url() method. Here is an end to end example where we copy a local file to Explorer S3 storage, then obtain a URL for it.

from gencove_explorer.models import File

# Make dummy file
example_file = Path("./example.txt")
example_file.touch()

# Create File object with no name
f = File(path_local=example_file)

# Copy to temporary location on S3
f.upload()

# Generate URL
file_url = f.as_url()

# Print out URL for file
print(file_url)

Sharing files within the organization

Files are uploaded to the user Explorer storage on S3 by default, however a file can be uploaded to a shared space, which all users from the same organization have access.

To manipulate (both downloading and uploading) shared files you only have to set org_shared to True.

from gencove_explorer.models import File

# Make dummy file
example_file = Path("./example.txt")
example_file.touch()

# Create File object with no name and org_shared set to True
f = File(path_local=example_file, org_shared=True)

# Copy to temporary location on S3
f.upload()