GCS
For customers who export data into Google Cloud Storage (GCS) (e.g. BigQuery exports, application logs, custom event streams), JustAI can sync the data to JustAI’s S3 bucket.
1) Prerequisites
Section titled “1) Prerequisites”You’ll need:
-
GCP
- A GCS bucket where your daily export lands, e.g.
gs://<your-gcs-bucket>/exports/ - JustAI will provision a service account, and you will need to grant read access to your bucket.
- A GCS bucket where your daily export lands, e.g.
-
AWS
- JustAI will create an HMAC key for the JustAI service account, and enable a DataSync.
-
Data format (recommended)
- Partitioned by time (daily or hourly).
- Columnar or compressed format (Parquet, JSON + gzip, etc).
- A predictable directory layout, e.g.:
s3://justwords-metrics-ingest/<org_slug>/gcs/events/dt=2025-01-01/...
2) GCP: Create / verify the service account
Section titled “2) GCP: Create / verify the service account”- Create (or reuse) a service account
- In the GCP Console → IAM & Admin → Service Accounts.
- Give it a name like
justai-datasync-reader.
- Grant it storage permissions on the GCS bucket:
- At minimum:
Storage Object Viewerongs://<your-gcs-bucket>
- At minimum:
- Verify the bucket path you’ll sync:
- For example:
gsutil ls -L gs://<your-gcs-bucket>/exports/ - This path (the
exports/prefix) is what you’ll point DataSync to.
- For example:
3) GCP: Create an HMAC key for the service account
Section titled “3) GCP: Create an HMAC key for the service account”AWS DataSync talks to GCS via an S3-compatible (XML) API using HMAC credentials.
- Using the
gcloudCLI:gcloud storage hmac create justai-datasync-reader@\<project-id\>.iam.gserviceaccount.com - Record the output:
accessId(HMAC Access Key ID)secret(HMAC Secret Key — shown once)
- Keep these somewhere secure; you’ll paste them into the DataSync GCS location.
4) AWS: Prepare the destination S3 bucket / prefix
Section titled “4) AWS: Prepare the destination S3 bucket / prefix”JustAI will read from a shared S3 bucket (either yours or a JustAI-provided bucket), typically under: s3://justwords-metrics-ingest/<org_slug>/...
-
Decide on a prefix for GCS-backed data, e.g.:
s3://justwords-metrics-ingest/<org_slug>/gcs/exports/ -
Create / update the bucket policy (if needed) to allow:
- The DataSync role to
s3:PutObject,s3:ListBucket,s3:GetBucketLocationunder that prefix. - (If JustAI needs to read it directly, we’ll share our AWS role ARN so you can grant us
s3:GetObject.)
- The DataSync role to
-
Confirm you can write to the prefix (simple sanity check with AWS CLI or a test script).
5) AWS: Create the DataSync locations
Section titled “5) AWS: Create the DataSync locations”5.1 GCS source location
Section titled “5.1 GCS source location”In the AWS console:
- Go to DataSync → Locations → Create location.
- Location type:
Object storage. - Configure:
- Server:
storage.googleapis.com - Bucket name:
<your-gcs-bucket> - Folder (prefix):
/exports/- This must match your GCS path. If your files live at
gs://<your-gcs-bucket>/exports/, then/exports/is correct.
- This must match your GCS path. If your files live at
- Authentication:
- Access Key: your HMAC
accessId - Secret Key: your HMAC
secret
- Access Key: your HMAC
- Server:
- Save the GCS location.
5.2 S3 destination location
Section titled “5.2 S3 destination location”Still in AWS DataSync:
- Create another location → Amazon S3.
- Configure:
- Bucket:
justwords-metrics-ingest - Subdirectory:
/ <org_slug> / gcs / exports /
(use a clean, slash-separated prefix — no trailing spaces) - S3 storage class: typically
STANDARD(or as agreed). - IAM role: a role that DataSync assumes, with:
s3:ListBucketon the bucket.s3:PutObjecton the chosen prefix.
- Bucket:
- Save the S3 location.
6) AWS: Create the DataSync task
Section titled “6) AWS: Create the DataSync task”- Go to DataSync → Tasks → Create task.
- Source location: your GCS object storage location.
- Destination location: your S3 location.
- Task settings:
- Transfer mode
- Start with “Copy all data”, then switch to “Copy only data that has changed” for subsequent runs.
- Object tags
- Disable / do not copy object tags — GCS XML API does not support tags the way S3 does, and leaving this on will cause errors.
- Filters (optional):
- Include:
exports/**(or narrower) - Exclude: temporary / partial files if your export tool uses them.
- Include:
- Transfer mode
- Schedule (recommended):
- Daily run after your GCS export completes, e.g.
- GCS export finishes ~09:00 UTC → schedule DataSync at 09:30 UTC.
- Daily run after your GCS export completes, e.g.
- Save the task.
7) Validate the DataSync pipeline
Section titled “7) Validate the DataSync pipeline”Run the task manually once:
- Start task → Run once now.
- Monitor in DataSync:
- Status should move from
PREPARING→TRANSFERRING→SUCCESS. Confirm in S3:s3://justwords-metrics-ingest/\<org\_slug\>/gcs/exports/...
- Status should move from
-
- Files and directory layout should match your GCS
exports/folder. - Spot-check a few files to ensure:
- They’re readable (Parquet/JSON opens correctly).
- Timestamps / partitions are correct (e.g.
dt=YYYY-MM-DD).
- Files and directory layout should match your GCS
8) Schema & partitioning recommendations
Section titled “8) Schema & partitioning recommendations”To make downstream analytics and attribution easy, we recommend:
- Partitioning
Daily or hourly partitions, e.g.: s3://justwords-metrics-ingest/<org_slug>/gcs/events/dt=2025-01-01/...
- Typical event schema
{ "event_timestamp": 1736387200, "user_id": "abc123", "event_name": "click", "copy_id": "6b3f2dd3-1c57-4f56-bc26-89af7bb6cb30", "template_id": "welcome_email_1", "campaign_id": "Iterable/Braze/etc", "vars": { "subject": "Welcome to our Community", "preheader": "Let’s get started!", "body": "<html>…</html>" }, "attrs": { "persona": "learner", "age": "18-24" }}- File format
- Parquet (preferred), or JSON + gzip.
- Reasonable file sizes (128–512 MB compressed) to avoid “many tiny files” issues.
If you already have a different schema in GCS, we can ingest that too — the main requirements are a timestamp, some kind of user or message ID, and stable partitioning.
9) Hand-off to JustAI
Section titled “9) Hand-off to JustAI”Once the pipeline is running:
- Share with JustAI:
- The S3 bucket + prefix you’re writing to.
- The event schema (fields + types).
- Expected cadence / SLA (e.g. daily by 10:00 UTC).
- We:
- Wire the S3 prefix into our ingestion jobs.
- Backfill a historical window (if available).
- Confirm records appear in dashboards / experiments.
10) Troubleshooting
Section titled “10) Troubleshooting”- Task succeeds but 0 files transferred
- Check the Folder/prefix in the GCS location (e.g.
/exports/). - Verify new files exist since the last run.
- Make sure include/exclude filters aren’t too aggressive.
- Check the Folder/prefix in the GCS location (e.g.
- Task fails with
ObjectTags PRESERVE cannot be used- Ensure “copy object tags” is disabled in the task options.
- Permission errors
- GCP:
- Confirm the service account has
Storage Object Vieweron the bucket.
- Confirm the service account has
- AWS:
- Confirm the DataSync IAM role has
s3:PutObject&s3:ListBucketon the target prefix.
- Confirm the DataSync IAM role has
- GCP:
- Performance issues
- Large backfills: consider one-off tasks per date range.
- Many small files: consider compacting GCS output before syncing (e.g. via a daily batch job).