Self-host data on the cloud

If your data is highly sensitive or your organization has data-sharing restrictions, you can host your data, and we can set up permissions for read only access. In this process, your files aren’t copied onto our file system; we simply store the cloud locations of your files in our database.

❗️

Canceling or Moving Self-Hosted Data

When self-hosted data is canceled or moved, the associated data asset is removed from labeling circulation and no longer appears in the Centaur portal.

Creating a self-hosted project

📘

Organization Settings

If your organization typically requires self-hosting, you can contact your project manager and ask them to set self-hosting as the default for your organization. After this setting is updated, new projects will automatically be created as self-hosted projects unless otherwise specified.

Creating a self hosted project can be done using the api, or in the portal by checking self-host my cloud data in the create project modal. After creating the project you may import data from your S3 bucket or Azure container. Note that no files will be copied over to Centaur's file system.

Self-hosting with S3

❗️

KMS Permissions

If your data is server-side encrypted with KMS (SSE-KMS), you will also need to grant the kms:Decrypt permission to the Centaur import role. The Centaur team will need your AWS Account ID to enable this feature.

Each of our clients (organizations) has an AWS software "role" in our system that is solely used for data sharing between Centaur and that client. You can find details about this role selecting your name at the bottom of the left hand side bar, then select Settings, and then Amazon S3 Integration.

Setting up the bucket

To enable self-hosted data sharing, you can give Centaur permission to access your S3 bucket either by granting read access to a “Centaur role” or creating your own role and granting us access to assume that role. These roles are given permission to create pre-signed URLs.

  • Centaur role: You can apply a permissions policy to your bucket granting read access to the role Centaur has created for your organization.
  • Customer role: You can create your own role, and give our role permission to assume that role. You can also optionally specify a secret ID that Centaur must use while assuming the role.

You will also need to set the CORS configuration for the bucket to allow the content to be displayed on the Centaur portal and DiagnosUs app.

You may use the following configuration to allow Centaur to access your data on s3:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "HEAD",
            "GET"
        ],
        "AllowedOrigins": [
            "https://go.centaurlabs.com",
            "https://beta.centaurlabs.com"
        ],
        "ExposeHeaders": []
    }
]

For information on how to set the CORS rules for your bucket please follow the AWS documentation here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/enabling-cors-examples.html

If you need a more restrictive configuration, please reach out to your project manager to discuss your options.

If server-side encryption is used, please reach out to your project manager, as additional manual configuration is required prior to importing data.

Self-hosting with Azure

Set up the Azure integration

For each Azure tenant you wish to share data from, a one-time configuration will be required. To complete this, log into the Centaur portal and select Organization Settings at the bottom of the left hand side bar. Select Azure , then Connect to Azure. When you click the Connect to Azure button, you will be prompted to sign into your Azure account and authorize the integration with Centaur Labs.

Set up the Storage Account permissions

❗️

Granting Permissions

In order to ensure correct access, the Storage Blob Delegator role must be provided at the storage account level, rather than at the container level.

For each Azure storage account from which you wish to share data, you will need to allow access by the centaurlabs-ingest application. In the Azure console, visit the page for your storage account. Under the "Access Control (IAM)" page, you can navigate to the "Role assignments" tab and add a role assignment for the centaurlabs-ingestapplication. The required role assignments are Storage Blob Data Reader and Storage Blob Delegator.

If the centaurlabs-ingest application is not available as a member when adding role assignments, verify that the "Set up the Azure integration" step above was properly completed. You can view your existing Azure integrations at any time in the Centaur portal.

The CORS settings for the storage account will also need to be updated to allow Centaur to serve resources from your account. In the Azure console, navigate to "Resource sharing (CORS)" for your storage account. Under the "Blob Service" tab, add entries for https://go.centaurlabs.com and https://beta.centaurlabs.com allowing the HEAD and GET operations.

Self-hosting with GCP

This guide will walk you through importing a dataset stored in Google Cloud Storage (GCS) to your Centaur Labs project for labeling.

Collect Your Assets in a Google Cloud Storage Bucket

Ensure that your dataset is stored in a Google Cloud Storage (GCS) bucket. If you don’t have a GCS bucket, follow Google’s guide to create one.

Add the Centaur Service Account to Your Bucket

Before importing, you need to grant Centaur Labs access to your GCS bucket:

  1. Navigate to Google Cloud Console.
  2. Select Cloud Storage and locate your bucket.
  3. Go to the Permissions tab and click + Add.
  4. Follow the centaur data imports instructions below to complete permissions setup for your GCS import.

Centaur Data Imports Flow

  1. In the Centaur Portal, go to the Data Imports tab, click Add Data, and select Google Cloud.
  2. Enter your absolute file path. Use the full gs:// path to your dataset in the Centaur API import flow. For example:
    gs://my-bucket/test-images/
    Instructions will then appear to guide you through setting the necessary permissions for your GCS import.
  3. Click Verify GCS access to verify the configured permissions, and you're all set to import your data from GCS.

Use the API for Specific File Imports (Optional)

If you need to import specific files instead of the entire folder, use the source_files parameter in the API call. For details, refer to the updated API documentation.

Your dataset is now successfully imported and ready for labeling!

Data Ingest

Once you have followed the steps above to create a self-hosted project and configure your Azure or S3 data source, you can proceed with import normally using the Centaur Portal or API.

ℹ️

Specific File Import

Specific files can be marked for importation using the source_files parameter in the API call. For more details, refer to the updated API documentation.

Specific Instructions for Text Data

For each text snippet create a single text file consisting of only that snippet. Add these text files to the s3 path you will later import. It is important that you only include one snippet per text file as the entire contents of the text file will always be treated as a single snippet.

Specific Instructions for Video Data

For video imports, the supported video codec is h264 and the supported audio codec is aac. The files have to adhere to these specifications in order to prevent validation failures during upload.

Stored Metadata

These are the metadata we store in our system for self-hosted data:

  • File locations
  • Asset dimensions
  • Asset duration
  • File type / extension