There's no stable release yet, but you can get everything currently running on blerg.dominionofawesome.com by cloning the git repository at http://git.bytex64.net/blerg.git.
Blërg has varying requirements depending on how you want to run it — as a standalone HTTP server, or as a CGI. You will need:
As a standalone HTTP, server, you will also need:
Or, as a CGI, you will need:
There is now an experimental autoconf build system. If you run
add-autoconf
, it'll do the magic and create a
configure
script that'll do the familiar things. If I ever
get around to distributing source packages, you should find that this
has already been done.
If you'd rather stick with the manual system, you should edit libs.mk and put in the paths where you can find headers and libraries for the above requirements.
Also, further apologies to BSD folks — I've probably committed several unconscious Linux-isms. It would not surprise me if the makefile refuses to work with BSD make, or if it fails to compile even with gmake. If you have patches or suggestions on how to make Blërg more portable, I'd be happy to hear them.
At this point, it should be gravy. Type 'make' and in a few seconds,
you should have blerg.httpd
, blerg.cgi
,
rss.cgi
, and blergtool
. Each of those can be
made individually as well, if you, for example, don't want to install
the prerequisites for blerg.httpd
or
blerg.cgi
.
While it's not strictly required, Blërg will be easier to set up if
you configure it to work from the root of your website. For this
reason, it's better to use a subdomain (i.e., blerg.yoursite.com is
easier than yoursite.com/blerg/). If you do want to put it in a
subdirectory, you will have to modify www/js/blerg.js
and
change baseURL at the top as well as a number of other self-references
in that file and www/index.html
. The CGI version should
work fine this way, but the HTTP version will require the request to be
rewritten, as it expects to be serving from the root.
You cannot serve the database and client from different domains (i.e., yoursite.com vs othersite.net, or even foo.yoursite.com and bar.yoursite.com). This is a requirement of the web browser — the same origin policy will not allow an AJAX request to travel across domains.
Right now, blerg.httpd
doesn't serve any static assets,
so you're going to have to put it behind a real webserver like apache,
lighttpd, nginx, or similar. Set the document root to the www
directory, then proxy /info, /create, /login, /logout, /get, /tag, and
/put to blerg.httpd. You can change the port blerg.httpd
listens on in config.h
.
Copy the files in www/ to the root of your web server. Copy
blerg.cgi
to your web server. Included in www-configs/ is
a .htaccess file for Apache that will rewrite the URLs. If you need to
call the CGI something other than blerg.cgi
, the .htaccess
file will need to be modified.
There is an optional RSS cgi (rss.cgi
) that will serve
RSS feeds for users. Install this like blerg.cgi
above.
Blërg's API was designed to be as simple as possible. Data sent from the client is POSTed with the application/x-www-form-urlencoded encoding, and a successful response is always JSON. The API endpoints will be described as though the server were serving requests from the root of the wesite.
On failure, all API calls return either a standard HTTP error response, like 404 Not Found if a record or user doesn't exist, or a 200 response with a 'JSON failure', which will look like this:
{"status": "failure"}
Blërg doesn't currently explain why there is a failure, and I'm not sure it ever will.
On success, you'll either get some JSON relating to your request (for /get, /tag, or /info), or a 'JSON success' response (for /create, /put, /login, or /logout), which looks like this:
{"status": "success"}
For the CGI backend, you may get a 500 error if something goes wrong. For the HTTP backend, you'll get nothing (since it will have crashed), or maybe a 502 Bad Gateway if you have it behind another web server.
All usernames must be 32 characters or less. Usernames must contain
only the ASCII characters 0-9, A-Z, a-z, underscore (_), and hyphen (-).
Passwords can be at most 64 bytes, and have no limits on characters (but
beware: if you have a null in the middle, it will stop checking there
because I use strncmp(3)
to compare).
Tags must be 64 characters or less, and can contain only the ASCII characters 0-9, A-Z, a-z, underscore (_), and hyphen (-).
To create a user, POST to /create with username
and
password
parameters for the new user. The server will
respond with JSON failure if the user exists, or if the user can't be
created for some other reason. The server will respond with JSON
success if the user is created.
POST to /login with the username
and
password
parameters for an existing user. The server will
respond with JSON failure if the user does not exist or if the password
is incorrect. On success, the server will respond with JSON success,
and will set a cookie named 'auth' that must be sent by the client when
accessing restricted API functions (/put and /logout).
POST to /logout with with username
, the user to log out,
along with the auth cookie in a Cookie header. The server will respond
with JSON failure if the user does not exist or if the auth cookie is
bad. The server will respond with JSON success after the user is
successfully logged out.
POST to /put with username
and data
parameters, and an auth cookie. The server will respond with JSON
failure if the auth cookie is bad, if the user doesn't exist, or if
data
contains more than 65535 bytes after URL
decoding. The server will respond with JSON success after the record is
successfully added.
A GET request to /get/(user), where (user) is the user desired, will return the last 50 records for that user in a list of objects. The record objects look like this:
{ "record":"0", "timestamp":1294309438, "data":"eatin a taco on fifth street" }
record
is the record number, timestamp
is
the UNIX epoch timestamp (i.e., the number of seconds since Jan 1 1970
00:00:00 GMT), and data
is the content of the record. The
record number is sent as a string because while Blërg supports record
numbers up to 264 - 1, Javascript uses floating point for all
its numbers, and can only support integers without truncation up to
253. This difference is largely academic, but I didn't want
this problem to sneak up on anyone who is more insane than I am. :]
The second form, /get/(user)/(start record)-(end record), retrieves a specific range of records, from (start record) to (end record) inclusive. You can retrieve at most 100 records this way. If (end record) - (start record) specifies more than 100 records, or if the range specifies invalid records, or if the end record is before the start record, the server will respond with JSON failure.
A GET request to /info/(user) will return a JSON object with information about the user (currently only the number of records). The info object looks like this:
{ "record_count": "544" }
Again, the record count is sent as a string for 64-bit safety.
A GET request to this endpoint will return the last 50 records
associated with the given tag. The first character is either # or H for
hashtags, or @ for mentions (I call them ref tags). You should URL
encode the # or @, lest some servers complain at you. The H alias for #
was created because Apache helpfully strips the fragment of a URL
(everything from the # to the end) before handing it off to the CGI,
even if the hash is URL encoded. The record objects also contain an
extra author
field, like so:
{ "author":"Jon", "record":"57", "timestamp":1294555793, "data":"I'm taking #garfield to the vet." }
There is currently no support for getting more than 50 tags, but /tag will probably mutate to work like /get.
POST to /subscribe/(user) with a username
parameter and
an auth cookie, where (user) is the user whose updates you wish to
subscribe to. The server will respond with JSON failure if the auth
cookie is bad or if the user doesn't exist. The server will respond
with JSON success after the subscription is successfully registered.
Identical to /subscribe, but removes the subscription.
POST to /feed, with a username
parameter and an auth
cookie. The server will respond with a JSON list of the last 50 updates
from all subscribed users, in reverse chronological order. Fetching
/feed resets the new message count returned from /feedinfo.
NOTE: subscription notifications are only stored while subscriptions are active. Any records inserted before or after a subscription is active will not show up in /feed.
POST to /feedinfo with a username
parameter and an auth
cookie to get general information about your subscribed feeds.
Currently, this only tells you how many new records there are since the
last time /feed was fetched. The server will respond with a JSON
object:
{"new":3}
POST to /feedinfo/(user) with a username
parameter and
an auth cookie, where (user) is a user whose subscription status you are
interested in. The server will respond with a simple JSON object:
{"subscribed":true}
The value of "subscribed" will be either true or false depending on the subscription status.
Blërg was created as the result of a thought experiment: "What if Twitter didn't need thousands of servers? What if its millions of users could be handled by a single highly efficient server?" This is probably an unreachable goal due to the sheer amount of I/O, but we can certainly try to do better. Blërg was thus designed as a system with very simple requirements:
And to further simplify, I didn't bother handling deletes, full text search, or more complicated tag searches. Blërg only does the basics.
Classical model |
---|
Client App HTML/Javascript |
Webserver Apache, lighttpd, nginx, etc. |
Server App Python, Perl, Ruby, etc. |
Database MySQL, PostgreSQL, MongoDB, CouchDB, etc. |
Modern web applications have at least a four-layer approach. You have the client-side browser app, the web server, the server-side application, and the database. Your data goes through a lot of layers before it actually resides on disk somewhere (or, as they're calling it these days, "The Cloud" *waves hands*). Each of those layers requires some amount of computing resources, so to increase throughput, we must make the layers more efficient, or reduce the number of layers.
Blërg model |
---|
Blërg Client App HTML/Javascript |
Blërg Database Fuckin' hardcore C and shit |
Blërg does both by smashing the last two or three layers into one application. Blërg can be run as either a standalone web server, or as a CGI (FastCGI support is planned, but I just don't care right now). Less waste, more throughput. As a consequence of this, the entirety of the application logic that the user sees is implemented in the client app in Javascript. That's why all the URLs have #'s — the page is loaded once and switched on the fly to show different views, further reducing load on the server. Even parsing hash tags and URLs are done in client JS.
The API is simple and pragmatic. It's not entirely RESTful, but is rather designed to work well with web-based front-ends. Client data is always POSTed with the usual application/x-www-form-urlencoded encoding, and server data is always returned in JSON format.
The HTTP interface to the database idea has already been done by CouchDB, though I didn't know that until after I wrote Blërg. :)
I was impressed by varnish's design, so I decided early in the design process that I'd try out mmaped I/O. Each user in Blërg has their own database, which consists of a metdata file, and one or more data and index files. The data and index files are memory mapped, which hopefully makes things more efficient by letting the OS handle when to read from disk (or maybe not &mdash I haven't benchmarked it). The index files are preallocated because I believe it's more efficient than writing to it 40 bytes at a time as records are added. The database's limits are reasonable:
maximum record size | 65535 bytes |
maximum number of records per database | 264 - 1 bytes |
maximum number of tags per record | 1024 |
Record Index Structure |
---|
offset (32-bit integer) |
length (16-bit integer) |
flags (16-bit integer) |
timestamp (32-bit integer) |
A record is stored by first appending the data to the data file, then writing an entry in the index file containing the offset and length of the data, as well as the timestamp. Since each index entry is fixed length, we can find the index entry simply by multiplying the record number we want by the size of the index entry. Upshot: constant-time random-access reads and constant-time writes. As an added bonus, because we're using append-only files, we get lockless reads.
Tag Structure |
---|
username (32 bytes) |
record number (64-bit integer) |
Tags are handled by a separate set of indices, one per tag. When a record is added, it is scanned for tags, then entries are appended to each tag index for the tags found. Each index record simply stores the user and record number. Tags are searched by opening the tag file, reading the last 50 entries or so, and then reading all the records listed. Voila, fast tag lookups.
At this point, you're probably thinking, "Is that it?" Yep, that's it. Blërg isn't revolutionary, it's just a system whose requirements were pared down until the implementation could be made dead simple.
Also, keeping with the style of modern object databases, I haven't implemented any data safety (har har). Blërg does not sync anything to disk before returning success. This should make Blërg extremely fast, and totally unreliable in a crash. But that's the way you want it, right? :]
When I first started thinking about the idea of subscriptions, I immediately came up with the naïve solution: keep a list of users to which users are subscribed, then when you want to get updates, iterate over the list and find the last entries for each user. And that would work, but it's kind of costly in terms of disk I/O. I have to visit each user in the list, retrieve their last few entries, and store them somewhere else to be sorted later. And worse, that computation has to be done every time a user checks their feed. As the number of users and subscriptions grows, that will become a problem.
So instead, I thought about it the other way around. Instead of doing all the work when the request is received, Blërg tries to do as much as possible by "pushing" updates to subscribed users. You can think of it kind of like a mail system. When a user posts new content, a notification is "sent" out to each of that user's subscribers. Later, when the subscribers want to see what's new, they simply check their mailbox. Checking your mailbox is usually a lot more efficient than going around and checking everyone's records yourself, even with the overhead of the "mailman."
The "mailbox" is a subscription index, which is identical to a tag index, but is a per-user construct. When a user posts a new record, a subscription index record is written for every subscriber. It's a similar amount of I/O as the naïve version above, but the important difference is that it's only done once. Retrieving records for accounts you're subscribed to is then as simple as reading your subscription index and reading the associated records. This is hopefully less I/O than the naïve version, since you're reading, at most, as many accounts as you have records in the last N entries of your subscription index, instead of all of them. And as an added bonus, since subscription index records are added as posts are created, the subscription index is automatically sorted by time! To support this "mail" architecture, we also keep a list of subscribers and subscrib...ees in each account.
Blërg probably doesn't actually work like Twitter because I've never actually had a Twitter account.
I couldn't find a really good fast HTTP server library.
Libmicrohttpd is small, but it's focused on embedded applications, so it
often eschews speed for small memory footprint. This is especially
apparent when you watch it chew through a POST request 300 bytes at a
time even though you've specified a buffer size of 256K.
blerg.httpd
is still pretty fast this way — on my
2GHz Opteron 246, siege says it serves a
690-byte /get request at about 945 transactions per second, average
response time 0.05 seconds, with 100 concurrent accesses — but a
fast HTTP server implementation could knock this out of the park.
Libmicrohttpd is also really difficult to work with. If you look at
the code, http_blerg.c
is about 70% longer than
cgi_blerg.c
simply because of all the iterator hoops I had
to jump through to process POST requests. And if you can believe it, I
wrote http_blerg.c
first. If I'd done it the other way
around, I probably would have given up on libmicrohttpd. :-/
The data structures written to disk are dependent on the size and endianness of the primitive data types on your architecture and OS. This means that the databases are not portable. A dump/import tool is probably the easiest way to handle this.
I do want to make a FastCGI version eventually, and this will probably be a rather simple modification of cgi_blerg.
Implementing deletes will be... interesting. There is room in the record index for a 'deleted' flag, but the problem is deleting any tags referenced in the data. This requires rescanning the record content and putting a 'deleted' flag in the tag indices. This will not be pretty, so I'm just going to ignore it and hope nobody makes any mistakes. ;]
Tag indices can grow arbitrarily large, which will cause problems for 32-bit machines around the 3GB mark. Still, that's something like 80 million tags, so maybe it's not something to worry about.
The API currently requires the client to transmit the user's password in the clear. A digest-based authentication scheme would be better, though for real security, the app should run over HTTPS.