GCS

For customers who export data into Google Cloud Storage (GCS) (e.g. BigQuery exports, application logs, custom event streams), JustAI can sync the data to JustAI’s S3 bucket.

1) Prerequisites

You’ll need:

GCP
- A GCS bucket where your daily export lands, e.g. gs://<your-gcs-bucket>/exports/
- JustAI will provision a service account, and you will need to grant read access to your bucket.
AWS
- JustAI will create an HMAC key for the JustAI service account, and enable a DataSync.
Data format (recommended)
- Partitioned by time (daily or hourly).
- Columnar or compressed format (Parquet, JSON + gzip, etc).
- A predictable directory layout, e.g.: s3://justwords-metrics-ingest/<org_slug>/gcs/events/dt=2025-01-01/...

2) GCP: Create / verify the service account

Create (or reuse) a service account
- In the GCP Console → IAM & Admin → Service Accounts.
- Give it a name like justai-datasync-reader.
Grant it storage permissions on the GCS bucket:
- At minimum:
  - Storage Object Viewer on gs://<your-gcs-bucket>
Verify the bucket path you’ll sync:
- For example: gsutil ls -L gs://<your-gcs-bucket>/exports/
- This path (the exports/ prefix) is what you’ll point DataSync to.

3) GCP: Create an HMAC key for the service account

AWS DataSync talks to GCS via an S3-compatible (XML) API using HMAC credentials.

Using the gcloud CLI: gcloud storage hmac create justai-datasync-reader@\<project-id\>.iam.gserviceaccount.com
Record the output:
- accessId (HMAC Access Key ID)
- secret (HMAC Secret Key — shown once)
Keep these somewhere secure; you’ll paste them into the DataSync GCS location.

4) AWS: Prepare the destination S3 bucket / prefix

JustAI will read from a shared S3 bucket (either yours or a JustAI-provided bucket), typically under: s3://justwords-metrics-ingest/<org_slug>/...

Decide on a prefix for GCS-backed data, e.g.: s3://justwords-metrics-ingest/<org_slug>/gcs/exports/
Create / update the bucket policy (if needed) to allow:
- The DataSync role to s3:PutObject, s3:ListBucket, s3:GetBucketLocation under that prefix.
- (If JustAI needs to read it directly, we’ll share our AWS role ARN so you can grant us s3:GetObject.)
Confirm you can write to the prefix (simple sanity check with AWS CLI or a test script).

5) AWS: Create the DataSync locations

5.1 GCS source location

In the AWS console:

Go to DataSync → Locations → Create location.
Location type: Object storage.
Configure:
- Server: storage.googleapis.com
- Bucket name: <your-gcs-bucket>
- Folder (prefix): /exports/
  - This must match your GCS path. If your files live at
    gs://<your-gcs-bucket>/exports/, then /exports/ is correct.
- Authentication:
  - Access Key: your HMAC accessId
  - Secret Key: your HMAC secret
Save the GCS location.

5.2 S3 destination location

Still in AWS DataSync:

Create another location → Amazon S3.
Configure:
- Bucket: justwords-metrics-ingest
- Subdirectory: / <org_slug> / gcs / exports /
  (use a clean, slash-separated prefix — no trailing spaces)
- S3 storage class: typically STANDARD (or as agreed).
- IAM role: a role that DataSync assumes, with:
  - s3:ListBucket on the bucket.
  - s3:PutObject on the chosen prefix.
Save the S3 location.

6) AWS: Create the DataSync task

Go to DataSync → Tasks → Create task.
Source location: your GCS object storage location.
Destination location: your S3 location.
Task settings:
- Transfer mode
  - Start with “Copy all data”, then switch to “Copy only data that has changed” for subsequent runs.
- Object tags
  - Disable / do not copy object tags — GCS XML API does not support tags the way S3 does, and leaving this on will cause errors.
- Filters (optional):
  - Include: exports/** (or narrower)
  - Exclude: temporary / partial files if your export tool uses them.
Schedule (recommended):
- Daily run after your GCS export completes, e.g.
  - GCS export finishes ~09:00 UTC → schedule DataSync at 09:30 UTC.
Save the task.

7) Validate the DataSync pipeline

Run the task manually once:

Start task → Run once now.
Monitor in DataSync:
- Status should move from PREPARING → TRANSFERRING → SUCCESS. Confirm in S3: s3://justwords-metrics-ingest/\<org\_slug\>/gcs/exports/...
- Files and directory layout should match your GCS exports/ folder.
- Spot-check a few files to ensure:
  - They’re readable (Parquet/JSON opens correctly).
  - Timestamps / partitions are correct (e.g. dt=YYYY-MM-DD).

8) Schema & partitioning recommendations

To make downstream analytics and attribution easy, we recommend:

Partitioning

Daily or hourly partitions, e.g.: s3://justwords-metrics-ingest/<org_slug>/gcs/events/dt=2025-01-01/...

Typical event schema

 {
  "event_timestamp": 1736387200,
  "user_id": "abc123",
  "event_name": "click",
  "copy_id": "6b3f2dd3-1c57-4f56-bc26-89af7bb6cb30",
  "template_id": "welcome_email_1",
  "campaign_id": "Iterable/Braze/etc",
  "vars": {
    "subject": "Welcome to our Community",
    "preheader": "Let’s get started!",
    "body": "<html>…</html>"
  },
  "attrs": {
    "persona": "learner",
    "age": "18-24"
  }
}

File format
- Parquet (preferred), or JSON + gzip.
- Reasonable file sizes (128–512 MB compressed) to avoid “many tiny files” issues.

If you already have a different schema in GCS, we can ingest that too — the main requirements are a timestamp, some kind of user or message ID, and stable partitioning.

9) Hand-off to JustAI

Once the pipeline is running:

Share with JustAI:
- The S3 bucket + prefix you’re writing to.
- The event schema (fields + types).
- Expected cadence / SLA (e.g. daily by 10:00 UTC).
We:
- Wire the S3 prefix into our ingestion jobs.
- Backfill a historical window (if available).
- Confirm records appear in dashboards / experiments.

10) Troubleshooting

Task succeeds but 0 files transferred
- Check the Folder/prefix in the GCS location (e.g. /exports/).
- Verify new files exist since the last run.
- Make sure include/exclude filters aren’t too aggressive.
Task fails with ObjectTags PRESERVE cannot be used
- Ensure “copy object tags” is disabled in the task options.
Permission errors
- GCP:
  - Confirm the service account has Storage Object Viewer on the bucket.
- AWS:
  - Confirm the DataSync IAM role has s3:PutObject & s3:ListBucket on the target prefix.
Performance issues
- Large backfills: consider one-off tasks per date range.
- Many small files: consider compacting GCS output before syncing (e.g. via a daily batch job).