Skip to content

Storage and Files

Storage of data on Gencove Explorer relies on two main mechanisms:

  1. Local storage
  2. Cloud storage, aka EOS (Explorer Object Storage)

Each Gencove Explorer instance contains its own local storage as you would expect with any virtual machine. However, this HDD storage space is limited, and is intended to be a transient or intermediary store for your data.

Larger persistent data is intended to be stored on EOS cloud storage, which is private to your organization. To that end, we provide several mechanisms through the Explorer SDK and CLI to enable easily managing your data in the cloud.

Local Instance Storage

The directory /home/explorer is your personal local storage area. Any programs, scripts, or data files you work with in Jupyter Lab can be saved here. To ensure optimal system performance, please keep track of your storage usage and manage your files appropriately, offloading any large data files to cloud storage as necessary (described in the following section).

By default, each Explorer instance is allocated 200GB of local disk space.

Running the command df -h through the Explorer terminal, you will see output similar to the following:

Filesystem      Size  Used Avail Use% Mounted on
overlay         196G  5.0G  181G   3% /
tmpfs            64M     0   64M   0% /dev
/dev/nvme1n1    196G  5.0G  181G   3% /home/explorer
shm              64M     0   64M   0% /dev/shm
/dev/nvme0n1p1   50G  2.6G   48G   6% /opt/ecs/metadata/111
tmpfs           7.7G     0  7.7G   0% /proc/acpi
tmpfs           7.7G     0  7.7G   0% /sys/firmware

While using the Gencove Explorer system, it is important to regularly take note of how much local storage space you have used. This can easily retrieved by running df -h /home/explorer in the terminal. If you are nearing your storage limit of 200 GB, consider offloading larger data files to cloud storage, described in the following section.

Cloud Storage, EOS

Data files commonly used in genomics applications are often very large (tens or even hundreds of Gb), and therefore can be unwieldy to work with.

Your Gencove Explorer instance and Analysis jobs are all configured with access to EOS (Explorer Object Storage) - private cloud object storage which can only be accessed by users in the Gencove Organization you belong to.

EOS is essentially a lightweight wrapper around AWS S3 object storage that aims to simplify access to the appropriate user, organization, and Gencove-wide locations on S3.

EOS locations

EOS URIs always start with e://. There are three top-level namespaces for EOS:

  1. User: e://users/<user-id>/ (and e://users/me/ as shorthand for the current user)
    • read/write for current user, only read for the rest of your organization
  2. Organization: e://org/
    • read/write for entire organization
  3. Gencove: e://gencove/
    • read for entire organization

EOS can be accessed using the Gencove Explorer CLI and Explorer SDK.

EOS via CLI

The Gencove CLI provides AWS CLI equivalents of ls, cp, rm, and sync commands that can be used for Explorer data management. Note that the AWS CLI is a dependency for this functionality. For example:

gencove explorer data cp local_file.txt e://users/me/project-1/file.txt

For ease of use, the gencove explorer data command is also available via the d (for "data") alias. The equivalent d command to the example above would be:

d cp local_file.txt e://users/me/project-1/file.txt

List, ls

Listing files can be accomplished with the ls command:

d ls
d ls e://users/me/project-1/

Copy, cp

Files can be uploaded and downloaded using the cp command:

d cp local_file.txt e://users/me/project-1/file.txt
d cp e://users/me/project-1/file.txt local_file.txt
d cp e://users/me/project-1/file.txt e://users/me/project-2/file.txt

Synchronize, sync

Files can be synced in bulk using the sync command:

d sync /home/explorer/project-1/ e://users/me/project-1/
d sync e://users/me/project-1/ /home/explorer/project-1/
d sync e://users/me/project-1/ e://users/me/project-2/

Delete, rm

Remote files can be deleted using the rm command:

d rm e://users/me/project-1/file.txt

EOS via SDK File object

File object

The Explorer SDK comes with an inbuilt abstraction of a File object, which represents a file with a local and remote location. It provides a way to specify and transfer (download and upload) files between local and remote storage.

Neither the local nor remote locations need to exist when the object is created, which effectively allows working with files in a "lazy" manner. This way, the abstraction of a file can be worked with but the actual upload or download of a potentially very large file is not effected until an explicit method call is invoked.

Neither local nor remote location need to be specified for the File object. If left unspecified, the following defaults are assumed:

  1. Local: a random filename in the /tmp directory
  2. Remote: a random key in a temporary EOS location

In this section, we describe the File object at a high level, and describe the various methods available to access and transfer (download and upload) remote files.

πŸ’‘ Note that β€œlocal” in this section refers to your Explorer Instance storage or Explorer Analysis job storage.

Downloading files

Parameters path_e and name

In cases where you would like to retrieve a file that already exists on EOS you can use the path_e parameter to File. The related name parameter to File is shorthand for e://users/me/<name>. For example, both of the File objects below point to the same remote file:

from gencove_explorer.models import File

f1 = File(path_e="e://users/me/project-1/file.txt")
f2 = File(name="project-1/file.txt")

Once a File object that refers to a remote file is created, you can make a local copy via the download() method:

# Copy file from EOS via path_e
f1.download()

# Copy file from EOS via name
f2.download()

# Print path to local copy of files
print(f1.path_local)
print(f2.path_local)

Parameter path_s3

For files that exist at an S3 location, you can use the path_s3 parameter to File:

from gencove_explorer.models import File

f3 = File(path_s3="s3://bucket/path/file.txt")

# Copy file from S3 (you must have necessary IAM permissions)
f3.download()

# Print path to local copy of file
print(f3.path_local)

Relationship between name, path_e, and path_s3

Parameters name, path_e, and path_s3 are directly related when referring to EOS locations as follows:

  1. The name parameter is shorthand for e://users/me/<name>
  2. All path_e URIs (e://...) directly correspond to a location in your organization's S3 bucket

For example:

from gencove_explorer.models import File

f = File(name="project-1/file.txt")
print(f)

results in the following values for name, path_e, and path_s3 in the resulting File object:

File(
    ...
    name="project-1/file.txt",
    path_e="e://users/me/project-1/file.txt",
    path_s3="s3://gencove-explorer-c4b6da6d/users/d3fcaf74-2152-4ed5-9834-9ecb1db4dd6c/files/project-1/file.txt",
    url="https://gencove-explorer-c4b6da6d.s3.amazonaws.com/...",
    ...
)

Parameter url

For files that exist on the public Internet, you can provide the URL via the url parameter to File:

from gencove_explorer.models import File

f4 = File(url="https://...")

# Copy file from URL (URL must be publicly accessible)
f4.download()

# Print path to local copy of file
print(f4.path_local)

Parameter path_local

Additionally, you can specify the local destination path and file name for downloading the file:

# Specify local path when creating the File object
f5 = File(name="project-1/file.txt", path_local="~/file.txt")
f5.download()

# Specify local path when calling .download()
f6 = File(name="project-1/file.txt")
f6.download(path_local="~/file.txt")

# Print path to local copy of files
print(f5.path_local)
print(f6.path_local)

πŸ’‘ All parameters of the File object must be provided by keyword, otherwise, the exception File.__init__() takes 1 positional argument but 2 were given will be raised.

πŸ’‘ Note that downloading files from FTP links via File.download() is not currently supported.

Uploading files

Similar to downloading files, they can be uploaded from local storage to EOS (or S3 more generally):

from gencove_explorer.models import File

f1 = File(
    path_local="~/file.txt",
    path_e="e://users/me/project-1/file.txt"
)
f1.upload()

Analogously to downloading objects, name, path_e, and path_s3 are all valid upload destinations.

Executing files

File objects can be executed as scripts by using the execute() method:

from gencove_explorer.models import File

f = File(path_local="~/script.sh")
f.execute()

Command-line parameters can be passed to execute() as follows:

from gencove_explorer.models import File

f = File(path_local="~/script.sh")
f.execute("parameter1", "parameter2")  # equivalent to: ~/script.sh parameter1 parameter2

In case it is preferable to process the output instead of printing it to the terminal, execute() can be configured to provide the output via its return value by setting capture_output to True:

from gencove_explorer.models import File

f = File(path_local="~/script.sh")
c = f.execute(capture_output=True)

print(c.stdout) # Standard output
print(c.stderr) # Standard error

File.execute() uses /bin/sh as the default interpreter, but any interpreter can be specified by setting the appropriate shebang line in the file. For example, to use Python as the interpreter add the following line to the top of the Python script:

#!/usr/bin/env python

Finally, a real-world example for executing a shell script (~/script.sh) on an array of inputs utilizing the features described in this section:

from gencove_explorer.models import File
from gencove_explorer.analysis import Analysis, InputShared

a = Analysis(
    input=["a","b","c"],
    input_shared=InputShared(
        script=File(path_local="~/script.sh").upload()
    ),
    function=lambda ac: ac.input_shared.script.execute(ac.input),
).run()

Temporary files

It is also possible to use the File object for creating temporary local and/or remote files.

πŸ’‘ Temporary files have an automatically generated filename that is guaranteed to be unique. The user should not expect that these files are permanently stored. We recommend using them for intermediate analysis results, while named files should be used for inputs and outputs.

To create a temporary remote file, exclude the name, path_e, or path_s3 parameters:

from gencove_explorer.models import File

# Create File object with no explicit destination
f = File(path_local="~/file.txt")

# Upload to temporary location
f.upload()

To create a temporary local file, exclude the path_local parameter:

from gencove_explorer.models import File

# Create File object with explicit destination
f = File(path_e="e://users/me/project-1/file.txt")

!echo 123 > {f.path_local}

# Upload to temporary location
f.upload()

To create a file that is temporary both locally and remotely, exclude all parameters:

from gencove_explorer.models import File

# Create temporary File object
f = File()

!echo 123 > {f.path_local}

# Upload to temporary location
f.upload()

# Get destination
print(f.path_e)

Generating URLs for remote File objects

It is possible to generate a temporary URL for a file in EOS via the url property.

πŸ’‘ The URLs generated with this method:

  1. Provide access to the file over the public Internet by anyone who has the URL
  2. Expire after 48 hours

Below is an end to end example where we copy a local file to EOS, then obtain a URL for it.

from gencove_explorer.models import File

# Create File object with no explicit destination
f = File(path_local="~/file.txt")

# Upload to temporary location
f.upload()

# Generate and print temporary URL for file
print(f.url)

Sharing files within your organization

Files can be easily shared within your organization by sharing the EOS URI (e://...) of objects.

πŸ’‘ Files created in your user namespace (e://users/me/) need to be made "shareable" by replacing the me shorthand with your user id. To simplify this process, the Explorer SDK provides a convenient File object attribute named path_e_shareable.

from gencove_explorer.models import File

# Create File object in user namespace
f = File(path_local="~/file.txt", path_e="e://users/me/project-1/file.txt")

# Copy to temporary location on EOR
f.upload()

print(f.path_e_shareable)

EOS URIs of files created in the organization namespace (e://org/) can be shared within your organization without modification.