Distribution of size of SSB feeds
Scroll down for graphs.
The latest
sbot RPC command lists the latest sequence number and timestamp of each feed your sbot has replicated. The latest sequence number is the number of messages in that feed. We can use this data to
look at the distribution of messages over feeds.
sbotc latest | jq -r .sequence | sort -rn > counts
In my data there are 9286 feeds, containing 617362 messages:
$ wc -l counts
9286 counts
$ awk '{sum+=$1} END{print sum}' counts
617362
$ flumecat -z ~/.ssb/flume/log.offset | grep -zc .
617362
The longest feed has 20080 messages. The most common feed length is 3, occurring in 1072 feeds:
$ head -1 counts
20080
$ uniq -c counts | sort -rn | head -1
1072 3
The median count of messages per feed (rounded down) is 9.
Half of the feeds (4643 feeds) contain 9 or less messages (totaling 20092 or 3.25% of messages).
The other half of feeds (4643 feeds) contain 9 or more messages (totaling 597270 or 96.75% of messages):
$ bc -lq
9286/2
4643.00000000000000000000
$ sed -n 4643p counts
9
$ head -4643 counts | awk '{sum+=$1} END{print sum}'
597270
$ tail -4643 counts | awk '{sum+=$1} END{print sum}'
20092
$ bc -lq
597270+20092
617362
20092/617362 * 100
3.25449250196805115900
597270/617362 * 100
96.74550749803194884000
The longest 10 feeds (0.11%) each contain at least 7112 messages, and in total account for 19.48% (120255) of messages:
$ head -1 counts
20080
$ sed -n 10p counts
7112
$ head -10 counts | awk '{sum+=$1} END{print sum}'
120255
$ bc -lq
10/9286 * 100
.10768899418479431400
120255/617362 * 100
19.47884709457336214400
50% of messages (308681) are from the top 78 feeds (0.84% of feeds):
$ awk '{sum+=$1} sum > 617362/2 {print NR; exit}' counts
78
$ bc -lq
78/9286 * 100
.83997415464139564900
617362/2
308681.00000000000000000000
The top 1.00% of feeds (93 feeds) each contain at least 918 messages, and in total contain 52.52% (324264) of messages:
$ bc -lq
9286 * 0.01
92.86
93/9286 * 100
1.00150764591858712000
$ sed -n 93p counts
918
$ head -93 counts | awk '{sum+=$1} END{print sum}'
324264
$ bc -lq
324264/617362 * 100
52.52412684940116171700
The longest 10.00% of feeds (929 feeds) each contain at least 74 messages, and in total contain 503853 messages (81.61% of messages).
$ bc -lq
9286 * 0.10
928.60
929/9286 * 100
10.00430755976739177200
$ sed -n 929p counts
74
$ head -929 counts | awk '{sum+=$1} END{print sum}'
503853
$ bc -lq
503853/617362 * 100
81.61386674268905439500
Here is a graph of the longest 10% of feeds:
pngspark -s 0 -h 512 counts.png < counts
convert counts.png -crop 10%x100%+0+0 feed-message-counts-10pct.png
Here is a graph of the length of all feeds, on a log scale:
awk '{print log($1)}' counts | pngspark -s 0 -h 1536 counts-log.png
convert counts-log.png -geometry 960x feed-message-counts-log.png
Links for tools used:
sbotc, pngspark, flumecat, convert
Note: there seems to be a bug in pngspark
where it doesn't display the lowest value on the right (which is nonzero).
I'm sure people more practiced with statistics could produce better, more informative graphs, tables and info. Here is just a start on some of the data you can get from scuttlebot and what we can learn from it.
Other things to do:
- Weight messages by their size in bytes
- Study length of feeds in terms of time