All posts
·5 min read

Scaling from 100 to 100,000 Users: A Security & Performance Checklist

Every order-of-magnitude jump breaks something different. A checklist for the bottlenecks and security gaps that bite at 1k, 10k, and 100k users.

Scaling from 100 to 100,000 Users: A Security & Performance Checklist

Every order-of-magnitude jump breaks something different.

What worked at 100 users will quietly fail at 1,000. What survives 1,000 will fall over at 10,000 in a way that takes you a weekend to diagnose. By 100,000 you have moved into a different system, even if the code looks the same.

This is a checklist for the things that bite at each scale. We've seen all of them more than once. Most are cheap to fix early and expensive to fix late.

At 100 users

You're optimizing for speed of iteration, not scale. That's correct. But three things you should do now because they're free:

  • Pick a real database from day one. SQLite for local, Postgres for production. Don't put off the migration — it's harder when you have data.
  • Hash passwords with bcrypt or argon2id. Not SHA-256, not MD5, not "I'll fix it later."
  • Put one alert on /login failures. That's it. One alert. The day you have an attacker, you'll be glad it's there.

You don't need queues, microservices, caches, or a CDN. You don't need any of it. Resist.

At 1,000 users

This is where the first cracks show. Symptom: things start "feeling slow" but you can't reproduce it locally.

Things that break:

  • N+1 queries. Your ORM is happily issuing 80 queries to render a list page. Add query logging, find the worst offender, fix it. Repeat.
  • Synchronous email. Sending welcome emails inline in the request blocks the response. Move to a background job (BullMQ, Sidekiq, anything). One worker is enough.
  • Cold-start auth. Your session lookup hits the database on every request. Either cache it (Redis, even single-node) or move to JWT-with-rotation if you understand the trade-offs.
  • Logging that doesn't aggregate. Log to stdout, ship to Datadog/Logtail/Better Stack. You will need to grep across servers within the next month.

Security additions at this stage:

  • Rate limit /login, /signup, /forgot-password (5/15min per IP and per account)
  • Add CSRF tokens if you're not on a framework that handles them
  • Move secrets out of .env into a real secret manager (AWS Secrets Manager, Doppler, Infisical)

At 10,000 users

This is where you stop being able to "just throw a bigger box at it." Real architecture decisions surface.

Performance:

  • Add a CDN. Static assets through CloudFront, Cloudflare, Fastly. This is the single highest-leverage performance change at this scale.
  • Add a read replica. Most of your traffic is reads. Send them to a replica, keep writes on the primary. Most ORMs handle this with one config change.
  • Add caching. Redis for session data, computed values, and rate limit counters. Cache anything that's expensive and changes less than once per minute.
  • Move long jobs out of HTTP. Anything that takes more than 500ms goes to a queue. Period.

Security at 10k:

  • Audit your dependency tree. npm audit --production weekly, with a process to triage Highs within 72 hours
  • Lock down admin. MFA mandatory, IP allowlist if practical, separate admin domain from app domain
  • Pen-test the authenticated surface. This is the scale at which an outside read pays for itself — see /services/penetration-testing
  • Set up tabletop incident response. Spend 2 hours pretending you got breached. The first time you do this, you'll find five things you don't have answers for. Better to find them in a meeting than at 3am.

Operational:

  • Database backups, tested. Untested backups are not backups. Restore one to a staging environment monthly.
  • Runbooks for the top five incidents you can imagine. Not a wiki — a checklist someone can follow at 3am while half-asleep.

At 100,000 users

You are no longer a startup. The system is large enough that no single engineer holds it in their head. New failure modes:

  • Hot keys in your database. One row gets read a million times a day. Cache aggressively, or denormalize.
  • Connection pool exhaustion. Your app servers have collectively opened more connections than the database can handle. Add a connection pooler (PgBouncer for Postgres) before this is a fire.
  • Background job pile-up. Your queue is 10 minutes behind. You need either more workers, smarter prioritization, or both.
  • Cache stampedes. Every request misses the cache simultaneously and lights the database on fire. Implement single-flight or probabilistic early expiration.
  • Tenant noise. One customer's traffic pattern destroys performance for everyone. You need per-tenant rate limits and ideally isolation for your largest customers.

Security at 100k:

  • Continuous monitoring. SIEM or at least centralized auth-event analysis. You will be targeted by automated attacks every minute.
  • Bug bounty. Even a small one ($100-$2k) brings outside eyes. HackerOne or Intigriti.
  • Formal security program. SOC 2 if you sell to enterprises, ISO 27001 if you sell internationally, both if you do both.
  • Real incident response. Not just runbooks — a defined on-call rotation, a status page, a designated communications lead.

Operational:

  • Multi-region failover if you have any latency-sensitive geography
  • Chaos testing — deliberately break things in staging, verify the system recovers
  • Cost monitoring — at 100k users you can quietly burn $100k/month on poorly-tuned infra without noticing

The pattern

What you'll notice across all four stages: most of the failures are not exotic. They're the same five categories — bad queries, missing caches, unbounded growth, sync where it should be async, and missing rate limits. The exact symptom changes with scale, but the underlying cause is the same.

Fix the boring stuff before it becomes the urgent stuff. That's the entire game.

If you're around 1,000-10,000 users and want a one-week read on which of these you're already exposed to, that's exactly the shape of our security audits and performance optimization work. Faster than you'd guess. Cheaper than the incident.

Want this read on your own app?

Free audit. Three findings, ranked. No credit card.