How we built shared-pool OpenClaw hosting (and why per-tenant Railway didn't scale)

An engineering deep-dive on ShipClaw's pool-node architecture: 50 users per Railway container, atomic FOR UPDATE SKIP LOCKED assignment, port-stride routing, and the 50-volumes-per-project ceiling that forced multi-project sharding.

Aria Keshmiri·

The first version of ShipClaw gave every user their own Railway service. One signup, one new service, one new volume, one new domain. It worked. It worked great, actually — for the first 40 users.

This post is the post-mortem on why we tore that out, and the architecture we replaced it with: shared pool nodes running ~50 users per Railway container, with atomic placement, port-stride routing, and a 1.5 GB memory watchdog. Most of this is visible in the codebase under src/lib/pool-manager.ts and openclaw-railway/server.js if you want to read the receipts.

Why per-tenant Railway broke

Three reasons, in increasing order of pain.

1. The 50-volumes-per-project ceiling

Railway caps a single project at 50 volumes. Each user got a volume for their OpenClaw workspace state. So we had a hard architectural ceiling at 50 paying users per Railway project.

You can spin up multiple Railway projects, but the Railway API is per-project — every project switch is a separate auth context, separate quotas, separate dashboards. We were going to have to write multi-project orchestration anyway.

2. Cold-start tax on idle agents

Each user's service had its own container baseline cost. An idle user — someone who deployed a Telegram bot and didn't message it for three days — was burning roughly the same compute as an active one. Multiply by N users and the unit economics of credit-based billing get genuinely bad.

The natural fix is "scale to zero," but Railway's scale-to-zero cold start at the time was 4–8 seconds. That's brutal for a Telegram agent where the user expects "hi" to get a reply in under two seconds.

3. Burst signups doubled-booked

We're targeting an influencer-driven launch (March 2026, post-flush). The expected curve is "200 signups in three minutes" when a video lands. Per-tenant Railway provisioning was racy enough under load that we'd see the same Railway service double-attached to two users in our DB. Cleanable, but only after a human noticed.

What we replaced it with

Shared pool nodes. One Railway container hosts up to 50 users. Each user gets:

  • A dedicated OpenClaw process spawned on demand (per-user isolation within the container)
  • A reserved port stride of 5 (gateway, Anthropic proxy, browser, Chrome relay, webhook)
  • A reserved workspace directory under /data/users/{userId}/
  • An idle suspension after 15 minutes (1 hour for Telegram-connected agents)

The pool node's server.js is a reverse proxy plus process manager. Incoming requests prefixed /{userId}/v1/* route to the user's gateway port. If the process isn't running, it spawns one. If memory exceeds 1.5 GB, the watchdog kills it and the next request respawns clean.

Atomic placement: FOR UPDATE SKIP LOCKED

The signup-burst problem demanded an atomic assignment primitive. The pattern:

SELECT * FROM "PoolNode"
WHERE "assignedCount" < "capacity"
  AND status = 'ACTIVE'
ORDER BY "assignedCount" ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;

The FOR UPDATE SKIP LOCKED is the magic. If 200 transactions hit this query simultaneously, each one walks past locked rows and grabs the next available pool node. No double-booking, no deadlock waits, no application-side retry storm. Postgres earns its salary on this query.

After the row is selected, we increment assignedCount, write the user→node mapping, and commit. If commit fails (network hiccup), the lock is released and the next signup gets that slot.

We tested this against a 500-concurrent-signup synthetic load. Zero double-bookings, zero pool-node assignment errors, p95 placement latency under 80 ms. Good enough.

Multi-project sharding

The 50-volumes-per-project ceiling didn't go away — pool nodes still need volumes, just one per node instead of one per user. With 50 users per node, four projects of 50 nodes each gets us to 10,000 users before we have to think harder. We've architected for four projects. At our current launch trajectory we will not need more than two of those any time soon.

The railway.ts library handles project selection by hashing the pool-node ID. The hash is deterministic, so a given pool node always lives in the same project, but the projects themselves are interchangeable.

Port stride: 5 per user

Each OpenClaw process needs five distinct local ports. Rather than picking five random free ports each spawn, we assigned each user a base port B and reserved B, B+1, B+2, B+3, B+4. The base ports go in increments of 10 to leave headroom.

This made the reverse proxy code trivial. The user's gateway is always at BASE + slot * 10. The Anthropic proxy is always +1. No port discovery, no service registry, no Consul. Just arithmetic.

Port stride math:

  • Pool node has 50 user slots
  • Each slot reserves 10 ports (5 used, 5 reserved for future tools)
  • Total port surface per pool node: 500 ports above the base

Linux ephemeral ports start at 32768. We base our pool nodes at 20000, leaving plenty of room.

The idle timeout: why 15 minutes (and 1 hour for Telegram)

Idle suspension is the lever that makes credit-billing work. If a user's agent runs 24/7 burning baseline browser memory, the unit economics drift toward "we're paying to host their idle Chromium."

15 minutes is the default. Empirically, a non-Telegram agent that hasn't received a request in 15 minutes is unlikely to receive one in the next 15.

Telegram is the exception. Telegram messages arrive on the user's schedule, not the bot's. A Telegram-connected agent that suspends after 15 minutes will visibly cold-start every time the user comes back from lunch. So we extended Telegram's timeout to 1 hour. The cost difference is small; the felt-quality difference is significant.

Memory watchdog: why 1.5 GB

OpenClaw + a stealth Chromium baseline is ~250 MB. A real browser session with a few tabs is ~600 MB. A misbehaving session — a memory leak in some site's JavaScript, an agent that keeps opening tabs without closing them — can climb to 2 GB+ and stay there.

We watchdog-kill at 1.5 GB. The next request respawns the process clean. From the user's perspective: a slightly slower response, no other visible effect. From the pool node's perspective: bounded blast radius. One user's runaway browser cannot starve the other 49.

This was the single highest-leverage hardening. Before the watchdog, one bad session could force a manual container restart, which logged out 50 users at once.

Sync debounce: 10 seconds

When a user updates their agent's config (e.g. saves a new bot token), we have to push the change to the pool node. The naive flow is: dashboard save → API call → pool node receives → pool node respawns the user's process with new config.

In practice, users save twice in quick succession all the time. (Form save, immediately notice a typo, fix it, save again.) Without a debounce, that's two respawns 200 ms apart, and the second one races the first.

The fix: a 10-second debounce on the sync endpoint. The pool node coalesces back-to-back syncs into one respawn. Catches almost all the human-double-save patterns; doesn't impact rare back-to-back legitimate changes.

What we'd do differently

If I were starting again today: same architecture, but I'd build the multi-project hashing on day one instead of bolting it on after we hit the 50-volumes ceiling. The migration path was uglier than the design.

I would also write the synthetic 500-concurrent-signup test on day one. The atomic placement pattern is correct, but I didn't believe it was correct until I'd watched it succeed under simulated burst load.

Try it

If you want to use the architecture without rebuilding it, that's literally the ShipClaw product. If you want to read the source, the openclaw-railway directory in our repo has the pool-mode server, and src/lib/pool-manager.ts has the placement logic.

— Aria