ATproto PDS indexer
[!IMPORTANT] Any use of this code and the data obtained with its help must adhere to Bluesky's Terms of Service and Community Guidelines.
In particular, you are not allowed to distribute any of the data without an explicit permission of the user that piece of data belongs to.
We also do not condone any use of the data obtained with this code for the purposes of:
- training ML/AI models w/o explicit consent of the users who own the data
- aiding any kind of harassment campaigns against anyone
This is a bunch of code that can download all of Bluesky into a giant table in PostgreSQL.
The structure of that table is roughly (repo, collection, rkey) -> JSON, and
it is a good idea to partition it by collection.
NOTE: all of this is valid as of December 2024, when Bluesky has ~24M accounts, ~4.7B records total, and average daily peak of ~1000 commits/s.
With a SATA SSD dedicated to ScyllaDB it can handle about 6000 commits/s from firehose. The actual number you'll get might be lower, if your CPU is not fast enough.
Once a day get a list of all repos from all known PDSs and adds any that are missing to the database.
Connects to firehose of each PDS and stores all received records in the database.
If CONSUMER_RELAYS is specified, it will also add any new PDSs to the database
that have records sent through a relay.
Goes over all repos that might have missing data, gets a full checkout from the PDS and adds all missing records to the database.
/${did} request returns a DID document.example.env to .env and edit it to your liking.
POSTGRES_PASSWORD can be anything, it will be used on the first start of
postgres container to initialize the database.docker-compose.override.yml.example to
docker-compose.override.yml to change some parts of docker-compose.yml
without actually editing it (and introducing possibility of merge conflicts
later on).make init-db
CONSUMER_RELAYS in
docker-compose.override.ymlmake upmake status - will show container status and resource usagemake psql - starts up SQL shell inside the postgres containermake logs - streams container logs into your terminalmake sqltop - will show you currently running queriesmake sqldu - will show disk space usage for each table and indexRecord indexer exposes a simple HTTP handler that allows to do this:
curl -s 'http://localhost:11003/pool/resize?size=10'
With partitioning by collection you can have separate indexes for each record
type. Also, doing any kind of heavy processing on a particular record type will
be also faster, because all of these records will be in a separate table and
PostgreSQL will just read them sequentially, instead of checking collection
column for each row.
You can do the partitioning at any point, but the more data you already have in the database, the longer will it take.
Before doing this you need to run lister at least once in order to create the
tables (make init-db does this for you as well).
postgres.psql.migrations dir for any additional
migrations you might be interested in.Go source code for Bluesky's atproto services.
@vvvot.bsky.social bot implementation
A cli application for bluesky social
A looking glass for the AT Proto Firehose
The AT Protocol blogging platform
A simplified JSON event stream for AT Proto
Your Brand Here!
50K+ engaged viewers every month
Limited spots available!
📧 Contact us via email🦋 Contact us on Bluesky