Skip to content

Storage and Files

Storage of data on Gencove Explorer relies on two main mechanisms:

  1. Local storage
  2. Cloud storage, aka EOS (Explorer Object Storage)

Each Gencove Explorer instance contains its own local storage as you would expect with any virtual machine. However, this SSD storage space is limited, and is intended to be a transient or intermediary store for your data.

Larger persistent data is intended to be stored on EOS cloud storage, which is private to your organization. To that end, we provide several mechanisms through the Explorer SDK and CLI to enable easily managing your data in the cloud.

Local Instance Storage

The directory /home/explorer is your personal local storage area. Any programs, scripts, or data files you work with in Jupyter Lab can be saved here. To ensure optimal system performance, please keep track of your storage usage and manage your files appropriately, offloading any large data files to cloud storage as necessary (described in the following section).

By default, each Explorer instance is allocated 200GB of local disk space.

Running the command df -h through the Explorer terminal, you will see output similar to the following:

Filesystem      Size  Used Avail Use% Mounted on
overlay         196G  5.0G  181G   3% /
tmpfs            64M     0   64M   0% /dev
/dev/nvme1n1    196G  5.0G  181G   3% /home/explorer
shm              64M     0   64M   0% /dev/shm
/dev/nvme0n1p1   50G  2.6G   48G   6% /opt/ecs/metadata/111
tmpfs           7.7G     0  7.7G   0% /proc/acpi
tmpfs           7.7G     0  7.7G   0% /sys/firmware

While using the Gencove Explorer system, it is important to regularly take note of how much local storage space you have used. This can easily retrieved by running df -h /home/explorer in the terminal. If you are nearing your storage limit of 200 GB, consider offloading larger data files to cloud storage, described in the following section.

Cloud Storage, EOS

Data files commonly used in genomics applications are often very large (tens or even hundreds of Gb), and therefore can be unwieldy to work with.

Your Gencove Explorer instance and Analysis jobs are all configured with access to EOS (Explorer Object Storage) - private cloud object storage which can only be accessed by users in the Gencove Organization you belong to.

EOS is essentially a lightweight wrapper around AWS S3 object storage that aims to simplify access to the appropriate user, organization, and Gencove-wide locations on S3.

In addition to its primary storage capabilities, EOS also provides archive and restore functionalities. This feature allows users to manage their data lifecycle by archiving infrequently accessed data and restoring it when needed. This ensures optimal storage utilization and cost efficiency.

EOS locations

EOS URIs always start with e://. There are three top-level namespaces for EOS:

  1. User: e://users/<user-id>/ (and e://users/me/ as shorthand for the current user)
    • read/write for current user, only read for the rest of your organization
  2. Organization: e://org/
    • read/write for entire organization
  3. Gencove: e://gencove/
    • read for entire organization

EOS can be accessed using the Gencove Explorer CLI and Explorer SDK.

EOS via CLI

The Gencove CLI provides AWS CLI equivalents of ls, cp, rm, and sync commands that can be used for Explorer data management. Note that the AWS CLI is a dependency for this functionality. For example:

gencove explorer data cp local_file.txt e://users/me/project-1/file.txt

For ease of use, the gencove explorer data command is also available via the ged alias. The equivalent ged command to the example above would be:

ged cp local_file.txt e://users/me/project-1/file.txt

List, ls

Listing files can be accomplished with the ls command:

ged ls
ged ls e://users/me/project-1/

Copy, cp

Files can be uploaded and downloaded using the cp command:

ged cp local_file.txt e://users/me/project-1/file.txt
ged cp e://users/me/project-1/file.txt local_file.txt
ged cp e://users/me/project-1/file.txt e://users/me/project-2/file.txt

Synchronize, sync

Files can be synced in bulk using the sync command:

ged sync /home/explorer/project-1/ e://users/me/project-1/
ged sync e://users/me/project-1/ /home/explorer/project-1/
ged sync e://users/me/project-1/ e://users/me/project-2/

Delete, rm

Remote files can be deleted using the rm command:

ged rm e://users/me/project-1/file.txt

Archive, archive

Archive files in bulk using the archive command:

ged archive e://users/me/project-1/

Archive files individually:

ged archive e://users/me/project-1/impute.vcf.gz

NOTE: Archived files reduce storage costs, but need to be restored before accessing them.

Restore, restore

Restore archived files in bulk using the restore command:

ged restore e://users/me/project-1/

Restore files individually:

ged restore e://users/me/project-1/impute.vcf.gz

EOS via SDK File object

File object

The Explorer SDK comes with an inbuilt abstraction of a File object, which represents a file with a local and/or remote location. It provides a way to specify and transfer (download and upload) files between local and remote storage.

Neither the local nor remote locations need to exist when the object is created, which effectively allows working with files in a "lazy" manner. This way, the abstraction of a file can be worked with but the actual upload or download of a potentially very large file is not effected until an explicit method call is invoked.

Neither local nor remote location need to be specified for the File object. If left unspecified, the following defaults are assumed:

  1. Local: a random filename in the /tmp directory
  2. Remote: a random key in a temporary EOS location

In this section, we describe the File object at a high level, and describe the various methods available to access and transfer (download and upload) remote files.

πŸ’‘ Note that β€œlocal” in this section refers to your Explorer Instance storage or Explorer Analysis job storage.

Parameters

Parameter remote

The remote parameter allows you to specify the remote location of the file. The remote parameter supports a number of object types, including:

  • S3File
  • EFile
  • URLFile
  • NamedFile
  • Plain strings

When a string is supplied to remote, the SDK will do a best attempt at identifying the proper object to map to your remote parameter. For example:

from gencove_explorer.file import File

f = File(remote="e://users/me/example_e.txt")
print(f"{f.remote=}")  # f.remote=EFile(path='e://users/me/example_e.txt')

The above demonstrates how the e:// prefix is used to infer the EFile object type. Likewise, similar logic applies to s3:// and https:// prefixes, which will create S3File and URLFile object types, respectively.

In addition, supported objects can be passed in directly. For example:

from gencove_explorer.file import File, EFile

f = File(remote=EFile(path="e://users/me/example_e.txt"))

The NamedFile type provides a shorthand for EOS storage and behaves similarly to an EFile. When none of the known protocols are matched (e.g. https://, e://, s3://) in the remote string, the system will assume a NamedFile. For example, the following two File objects are equivalent:

from gencove_explorer.file import File

f1 = File(remote="e://users/me/example_e.txt")  # <- produces EFile
f2 = File(remote="example_e.txt")               # <- produces NamedFile

Parameter local

The local parameter allows you to specify the local location of a file. The parameter represents either an existing file, or a desired location for a file to be downloaded to, depending on context.

The following example demonstrates the following behavior involving the local parameter:

  1. Uploading a local file to EOS
  2. Downloading a remote file to a local location
from gencove_explorer.file import File
from pathlib import Path

# Create a dummy file to upload
upload_demo_file = Path("/tmp/demo.txt")
upload_demo_file.write_text("example")

# Upload local file to remote destination on EOS
f1 = File(local=upload_demo_file, remote="upload_demo/upload_demo_file.txt")
f1.upload()

# Specify local path when creating the File object
f2 = File(remote="upload_demo/upload_demo_file.txt", local="/tmp/example_download.txt")
f2.download()

# Alternatively local path when calling .download()
f3 = File(remote="project-1/upload_demo_file.txt")
f3.download(local="/tmp/example_download_2.txt")

# Print path to local copy of files
print(f2.local)
print(f3.local)

Additional information on the download() and upload() methods can be found in the following sections.

πŸ’‘ All parameters `of theFileobject must be provided by keyword, otherwise, the exceptionFile.init() takes 1 positional argument but 2 were given` will be raised.

Downloading files

Once a File object that refers to a remote file is created, you can retrieve a local copy via the download() method. For example:

from gencove_explorer.file import File

# Copy file from EOS via remote
f1 = File(remote="e://users/me/example_e.txt")
f1.download()

# Copy file from EOS via name
f2 = File(remote="example_e.txt")
f2.download()

# Print path to local copies of files
print(f1.local)
print(f2.local)

Other remote objects

For files that exist remotely outside of EOS, you can use the remote parameter of File. The File object supports a number of sources, including:

  • S3 paths
  • URLs

Downloading an S3 object

from gencove_explorer.file import File

f3 = File(remote="s3://bucket/path/file.txt")

# Copy file from S3 (you must have the necessary IAM permissions)
f3.download()

# Print path to local copy of file
print(f3.local)

Downloading a file from a URL

For files that exist on the public Internet, you can provide the URL via the remote parameter to File:

from gencove_explorer.file import File

f4 = File(remote="https://...")

# Copy file from URL (URL must be publicly accessible)
f4.download()

# Print path to local copy of file
print(f4.local)

πŸ’‘ Note that downloading files from FTP links via File.download() is not currently supported.

Uploading files

Files can also be uploaded from local storage to EOS (or S3 more generally):

from gencove_explorer.file import File

f1 = File(
    local="~/file.txt",
    remote="e://users/me/project-1/file.txt"
)
f1.upload()

Executing files

File objects can be executed as scripts by using the execute() method:

from gencove_explorer.file import File

f = File(local="~/script.sh")
f.execute()

Command-line parameters can be passed to execute() as follows:

from gencove_explorer.file import File

f = File(local="~/script.sh")
f.execute("parameter1", "parameter2")  # equivalent to: ~/script.sh parameter1 parameter2

In case it is preferable to process the output instead of printing it to the terminal, execute() can be configured to provide the output via its return value by setting capture_output to True:

from gencove_explorer.file import File

f = File(local="~/script.sh")
c = f.execute(capture_output=True)

print(c.stdout) # Standard output
print(c.stderr) # Standard error

File.execute() uses /bin/sh as the default interpreter, but any interpreter can be specified by setting the appropriate shebang line in the file. For example, to use Python as the interpreter add the following line to the top of the Python script:

#!/usr/bin/env python

Finally, a real-world example for executing a shell script (~/script.sh) on an array of inputs utilizing the features described in this section:

from gencove_explorer.file import File
from gencove_explorer.analysis import Analysis, InputShared

a = Analysis(
    input=["a","b","c"],
    input_shared=InputShared(
        script=File(local="~/script.sh").upload()
    ),
    function=lambda ac: ac.input_shared.script.execute(ac.input),
).run()

Temporary files

It is also possible to use the File object for creating temporary local and/or remote files.

πŸ’‘ Temporary files have an automatically generated filename that is guaranteed to be unique. The user should not expect that these files are permanently stored. We recommend using them for intermediate analysis results, while named files should be used for inputs and outputs.

To create a temporary remote file, exclude the name and remote parameters:

from gencove_explorer.file import File

# Create File object with no explicit destination
f = File(local="~/file.txt")

# Upload to temporary location
f.upload()

To create a temporary local file, exclude the local parameter:

from gencove_explorer.file import File

# Create File object with explicit destination
f = File(remote="e://users/me/project-1/file.txt")

!echo 123 > {f.local}

# Upload to temporary location
f.upload()

To create a file that is temporary both locally and remotely, exclude all parameters:

from gencove_explorer.file import File

# Create temporary File object
f = File()

!echo 123 > {f.local}

# Upload to temporary location
f.upload()

# Get destination
print(f.remote)

Generating URLs for remote File objects

It is possible to generate a temporarily accessible URL for a file in EOS via the remote url property.

πŸ’‘ The URLs generated with this method:

  1. Provide access to the file over the public Internet by anyone who has the URL
  2. Expire after 48 hours

Below is an end-to-end example where we copy a local file to EOS, then obtain a URL for it.

from gencove_explorer.file import File

# Create File object with no explicit destination
f = File(local="~/file.txt")

# Write contents to file
!echo "URL example" > {f.local}

# Upload to temporary location
f.upload()

# Generate and print temporary URL for file
print(f.remote.url)

Sharing files within your organization

Files can be easily shared within your organization by sharing the EOS URI (e://...) of objects.

πŸ’‘ Files created in your user namespace (e://users/me/) need to be made "shareable" by replacing the me shorthand with your user id. To simplify this process, the Explorer SDK provides a convenient File object attribute named path_e_shareable.

from gencove_explorer.file import File

# Create File object in user namespace
f = File(local="~/file.txt", remote="e://users/me/project-1/file.txt")

# Copy to temporary location on EOR
f.upload()

print(f.path_e_shareable)

EOS URIs of files created in the organization namespace (e://org/) can be shared within your organization without modification.