X-Git-Url: http://git.bytex64.net/?a=blobdiff_plain;f=www%2Fdoc%2Findex.html;h=61b8035a528b88d999ded944cc536072456988da;hb=ec5e57cc0f39e47903322f3259b2f28b982e0d15;hp=b188705d9495f5b74d41db5a47d9aa69fdb97074;hpb=dc8f9bcf4c20a758e37ea85698e3698b7586b39f;p=blerg.git diff --git a/www/doc/index.html b/www/doc/index.html index b188705..61b8035 100644 --- a/www/doc/index.html +++ b/www/doc/index.html @@ -3,6 +3,7 @@ Blërg Documentation + @@ -17,7 +18,7 @@ as either a standalone HTTP server, or a CGI. Blërg is written in pure C. -

Installing

+

Running Blërg

Getting the source

-

There's no stable release, yet, but you can get everything currently +

There's no stable release yet, but you can get everything currently running on blerg.dominionofawesome.com by cloning the git repository at http://git.bytex64.net/blerg.git. @@ -81,56 +94,101 @@ sense of humor, requires ruby to compile)

Configuring

-

I know I'm gonna get shit for not using an autoconf-based system, but -I really didn't want to waste time figuring it out. You should edit -libs.mk and put in the paths where you can find headers and libraries -for the above requirements. +

There is now an experimental autoconf build system. If you run +add-autoconf, it'll do the magic and create a +configure script that'll do the familiar things. If I ever +get around to distributing source packages, you should find that this +has already been done. + +

If you'd rather stick with the manual system, you should edit libs.mk +and put in the paths where you can find headers and libraries for the +above requirements.

Also, further apologies to BSD folks — I've probably committed several unconscious Linux-isms. It would not surprise me if the -makefile refuses to work with BSD make. If you have patches or -suggestions on how to make Blërg more portable, I'd be happy to hear -them. +makefile refuses to work with BSD make, or if it fails to compile even +with gmake. If you have patches or suggestions on how to make Blërg +more portable, I'd be happy to hear them.

Building

At this point, it should be gravy. Type 'make' and in a few seconds, -you should have http_blerg, cgi_blerg, -rss, and blergtool. Each of those can be made -individually as well, if you, for example, don't want to install the -prerequisites for http_blerg or cgi_blerg. +you should have blerg.httpd, blerg.cgi, +rss.cgi, and blergtool. Each of those can be +made individually as well, if you, for example, don't want to install +the prerequisites for blerg.httpd or +blerg.cgi. + +

NOTE: blerg.httpd is deprecated and will not be +updated with new features.

Installing

-

While it's not required, Blërg will be easier to set up if you -configure it to work from the root of your website. For this reason, -it's better to use a subdomain (i.e., blerg.yoursite.com is easier than -yoursite.com/blerg/). If you do want to put it in a subdirectory, you -will have to modify www/js/blerg.js and change baseURL at the top. The -CGI version should work fine this way, but the HTTP version will require -the request to be rewritten, as it expects to be serving from the root. +

While it's not strictly required, Blërg will be easier to set up if +you configure it to work from the root of your website. For this +reason, it's better to use a subdomain (i.e., blerg.yoursite.com is +easier than yoursite.com/blerg/). If you do want to put it in a +subdirectory, you will have to modify www/js/blerg.js and +change baseURL at the top as well as a number of other self-references +in that file and www/index.html. + +

You cannot serve the database and client from different domains +(i.e., yoursite.com vs othersite.net, or even foo.yoursite.com and +bar.yoursite.com). This is a requirement of the web browser — the +same origin policy will not allow an AJAX request to travel across +domains (though you can probably get around it these days with Cross-origin + resource sharing). + +

For straight CGI with Apache

+ +

Copy the files in www/ to the root of your web server. Copy +blerg.cgi to your web server. Included in www-configs/ is +a .htaccess file for Apache that will rewrite the URLs. If you need to +call the CGI something other than blerg.cgi, the .htaccess +file will need to be modified. + +

For nginx

+ +

Nginx can't run CGI directly, and there's currently no FastCGI +version of Blërg, so you will have to run it under some kind of CGI to +FastCGI gateway, like the one described here on the nginx wiki. This +pretty much destroys the performance of Blërg, but it's all we've got +right now. -

For the standalone web server:

+

The extra RSS CGI

-

Right now, http_blerg doesn't serve any static assets, so you're -going to have to put it behind a real webserver like apache, lighttpd, -nginx, or similar. Set the document root to the www directory, then -proxy /info, /create, /login, /logout, /get, /tag, and /put to -http_blerg. +

There is an optional RSS cgi (rss.cgi) that will serve +RSS feeds for users. Install this like blerg.cgi above. +As of 1.9.0, this is a perl FastCGI script, so you will have to make +sure the perl libraries are available to it. A good way of doing that +is to install to an environment directory, as described below. -

For the CGI version:

+

Installing to an environment directory

-

Copy the files in www to the root of your web server. Copy cgi_blerg -to blerg.cgi somewhere on your web server. Included in www-configs is a -.htaccess file for apache that will rewrite the URLs. If you need to -call cgi_blerg something other than blerg.cgi, the .htaccess file will -need to be modified. +

The Makefile has support for installing Blërg into a directory that +includes tools, libraries, and configuration snippets for shell and web +servers. Use it as make install-environment + ENV_DIR=<directory>. Under <directory>/etc will be +a shell script that sets environment variables, and configuration +snippets for nginx and apache to do the same. This should make it +somewhat easier to use Blërg in a self-contained way. -

The extra RSS CGI

+

For example, this will install Blërg to an environment directory +inside your home directory: -

There is an optional RSS cgi (called simply rss) that will serve RSS -feeds for users. Install this like the CGI version above (on my server, -it's at /rss.cgi). +

user@devhost:~/blerg$ make install-environment ENV_DIR=$HOME/blerg-env
+...
+user@devhost:~/blerg$ . ~/blerg-env/etc/env.sh
+
+ +

Then, you will be able to run tools like blergtool, and +it will operate on data inside ~/blerg-env/data. Likewise, +you can include +/home/user/blerg-env/etc/nginx-fastcgi-vars.conf or +/home/user/blerg-env/etc/apache-setenv.conf in your +webserver to make the CGI/FastCGI scripts to the same thing.

API

@@ -145,7 +203,7 @@ root of the wesite.

On failure, all API calls return either a standard HTTP error response, like 404 Not Found if a record or user doesn't exist, or a 200 -response with some JSON indicating failure, which will look like this: +response with a 'JSON failure', which will look like this:

{"status": "failure"} @@ -153,8 +211,8 @@ response with some JSON indicating failure, which will look like this: I'm not sure it ever will.

On success, you'll either get some JSON relating to your request (for -/get, /tag, or /info), or a JSON object indicating success (for /create, -/put, /login, or /logout), which looks like this: +/get, /tag, or /info), or a 'JSON success' response (for /create, /put, +/login, or /logout), which looks like this:

{"status": "success"} @@ -163,47 +221,46 @@ For the HTTP backend, you'll get nothing (since it will have crashed), or maybe a 502 Bad Gateway if you have it behind another web server.

All usernames must be 32 characters or less. Usernames must contain -only the ASCII characters 0-9, A-Z, a-z, underscore (_), period (.), -hyphen (-), single quote ('), and space ( ). Passwords can be at most -64 bytes, and have no limits on characters (but beware: if you have a -null in the middle, it will stop checking there because I use -strncmp(3) to compare). +only the ASCII characters 0-9, A-Z, a-z, underscore (_), and hyphen (-). +Passwords can be at most 64 bytes, and have no limits on characters (but +beware: if you have a null in the middle, it will stop checking there +because I use strncmp(3) to compare).

Tags must be 64 characters or less, and can contain only the ASCII -characters 0-9, A-Z, a-z, hyphen (-), and underscore (_). +characters 0-9, A-Z, a-z, underscore (_), and hyphen (-).

/create - create a new user

To create a user, POST to /create with username and password parameters for the new user. The server will -respond with failure if the user exists, or if the user can't be created -for some other reason. The server will respond with success if the user -is created. +respond with JSON failure if the user exists, or if the user can't be +created for some other reason. The server will respond with JSON +success if the user is created.

/login - log in

POST to /login with the username and password parameters for an existing user. The server will -respond with failure if the user does not exist or if the password is -incorrect. On success, the server will respond with success, and will -set a cookie named 'auth' that must be sent by the client when accessing -restricted API functions (/put and /logout). +respond with JSON failure if the user does not exist or if the password +is incorrect. On success, the server will respond with JSON success, +and will set a cookie named 'auth' that must be sent by the client when +accessing restricted API functions (/put and /logout).

/logout - log out

POST to /logout with with username, the user to log out, along with the auth cookie in a Cookie header. The server will respond -with failure if the user does not exist or if the auth cookie is bad. -The server will respond with success after the user is successfully -logged out. +with JSON failure if the user does not exist or if the auth cookie is +bad. The server will respond with JSON success after the user is +successfully logged out.

/put - add a new record

POST to /put with username and data -parameters, and an auth cookie. The server will respond with failure -if the auth cookie is bad, if the user doesn't exist, or if +parameters, and an auth cookie. The server will respond with JSON +failure if the auth cookie is bad, if the user doesn't exist, or if data contains more than 65535 bytes after URL -decoding. The server will respond with success after the record is +decoding. The server will respond with JSON success after the record is successfully added.

/get/(user), /get/(user)/(start record)-(end record) - get records for a user

@@ -232,8 +289,9 @@ this problem to sneak up on anyone who is more insane than I am. :]

The second form, /get/(user)/(start record)-(end record), retrieves a specific range of records, from (start record) to (end record) inclusive. You can retrieve at most 100 records this way. If (end -record) - (start record) specifies more than 100 records, the server -will respond with JSON failure. +record) - (start record) specifies more than 100 records, or if the +range specifies invalid records, or if the end record is before the +start record, the server will respond with JSON failure.

/info/(user) - Get information about a user

@@ -260,16 +318,103 @@ was created because Apache helpfully strips the fragment of a URL even if the hash is URL encoded. The record objects also contain an extra author field, like so: +
 {
   "author":"Jon",
   "record":"57",
   "timestamp":1294555793,
   "data":"I'm taking #garfield to the vet."
 }
+

There is currently no support for getting more than 50 tags, but /tag will probably mutate to work like /get. +

/subscribe/(user) - Subscribe to a +user's updates

+ +

POST to /subscribe/(user) with a username parameter and +an auth cookie, where (user) is the user whose updates you wish to +subscribe to. The server will respond with JSON failure if the auth +cookie is bad or if the user doesn't exist. The server will respond +with JSON success after the subscription is successfully registered. + +

/unsubscribe/(user) - Unsubscribe from +a user's updates

+ +

Identical to /subscribe, but removes the subscription. + +

/feed - Get updates for subscribed users

+ +

POST to /feed, with a username parameter and an auth +cookie. The server will respond with a JSON list of the last 50 updates +from all subscribed users, in reverse chronological order. Fetching +/feed resets the new message count returned from /feedinfo. + +

NOTE: subscription notifications are only stored while subscriptions +are active. Any records inserted before or after a subscription is +active will not show up in /feed. + +

/feedinfo, /feedinfo/(user) - Get subscription +status for a user

+ +

POST to /feedinfo with a username parameter and an auth +cookie to get general information about your subscribed feeds. +Currently, this only tells you how many new records there are since the +last time /feed was fetched. The server will respond with a JSON +object: + +

+{"new":3}
+
+ +

POST to /feedinfo/(user) with a username parameter and +an auth cookie, where (user) is a user whose subscription status you are +interested in. The server will respond with a simple JSON object: + +

+{"subscribed":true}
+
+ +

The value of "subscribed" will be either true or false depending on +the subscription status. + +

/passwd - Change a user's password

+ +

POST to /passwd with a username parameter and an auth +cookie, plus password and new_password +parameters to change the user's password. For extra protection, +changing a password requires sending the user's current password in the +password parameter. If authentication is successful and +the password matches, the user's password is set to +new_password and the server responds with JSON success. + +If the password doesn't match, or one of password or +new_password are missing, the server returns JSON failure. + +

Libraries

+ +

C

+ +

Most of Blërg's core functionality is packaged in a static library +called blerg.a. It's not designed to be public or +installed with `make install-environment`, but it should be relatively +straightforward to use it in C programs. Look at the headers under the +databse directory. + +

A secondary library called blerg_auth.a handles the +authentication layer of Blërg. To use it, look at +common/auth.h. + +

Perl

+ +

As of 1.9.0, Blërg includes a perl library called +Blerg::Database. It wraps the core and authentication +functionality in a perlish interface. The module has its own POD +documentation, which you can read with your favorite POD reader, from +the manual installed in an environment directory, or in HTML here. +

Design

Motivation

@@ -277,9 +422,9 @@ will probably mutate to work like /get.

Blërg was created as the result of a thought experiment: "What if Twitter didn't need thousands of servers? What if its millions of users could be handled by a single highly efficient server?" This is probably -an unreachable goal due to the sheer amount of I/O, but we could -certainly do better. Blërg was thus designed as a system with very -simple requirements: +an unreachable goal due to the sheer amount of I/O, but we can certainly +try to do better. Blërg was thus designed as a system with very simple +requirements:

  1. Store and fetch small chunks of text efficiently
  2. @@ -309,11 +454,12 @@ search, or more complicated tag searches. Blërg only does the basics.

    Modern web applications have at least a four-layer approach. You -have the client-side browser app written in HTML and Javascript, the web -server, the server-side application typically written in some scripting -language (or, if it's high-performance, ASP/Java/C/C++), and the -database (usually SQL, but newer web apps seem to love object-oriented -DBs). +have the client-side browser app, the web server, the server-side +application, and the database. Your data goes through a lot of layers +before it actually resides on disk somewhere (or, as they're calling it +these days, "The Cloud" *waves hands*). Each of those layers requires +some amount of computing resources, so to increase throughput, we must +make the layers more efficient, or reduce the number of layers. @@ -321,18 +467,20 @@ DBs). - +
    Blërg model
    Blërg Client App
    HTML/Javascript
    Blërg Database
    Blërg Database
    Fuckin' hardcore C and shit
    -

    Blërg compresses the last two or three layers into one application. -Blërg can be run as either a standalone web server, or as a CGI (FastCGI -support is planned, but I just don't care right now). Less waste, more -throughput. As a consequence of this, the entirety of the application -logic that the user sees is implemented in the client app in Javascript. -That's why all the URLs have #'s — the page is loaded once and -switched on the fly to show different views, further reducing load on -the server. Even parsing hash tags and URLs are done in client JS. +

    Blërg does both by smashing the last two or three layers into one +application. Blërg can be run as either a standalone web server +(currently deprecated because maintaining two versions is hard), or as a +CGI (FastCGI support is planned, but I just don't care right now). Less +waste, more throughput. As a consequence of this, the entirety of the +application logic that the user sees is implemented in the client app in +Javascript. That's why all the URLs have #'s — the page is loaded +once and switched on the fly to show different views, further reducing +load on the server. Even parsing hash tags and URLs are done in client +JS.

    The API is simple and pragmatic. It's not entirely RESTful, but is rather designed to work well with web-based front-ends. Client data is @@ -345,48 +493,62 @@ until after I wrote Blërg. :)

    Database

    -

    Early in the design process, I decided to blatantly copy varnish and rely heavily on -mmap for I/O. Each user in Blërg has their own database, which consists -of one or more data and index files, and a metadata file. When a -database is opened, only the metadata is actually read (currently a -single 64-bit integer keeping track of the last record id). The data -and index files are memory mapped, which hopefully makes things more -efficient by letting the OS handle when to read from disk. The index -files are preallocated because I believe it's more efficient than -writing to it 40 bytes at a time as records are added. Here's some info -on the database's limitations: +

    I was impressed by varnish's design, so I decided +early in the design process that I'd try out mmaped I/O. Each user in +Blërg has their own database, which consists of a metdata file, and one +or more data and index files. The data and index files are memory +mapped, which hopefully makes things more efficient by letting the OS +handle when to read from disk (or maybe not — I haven't +benchmarked it). The index files are preallocated because I believe +it's more efficient than writing to it 40 bytes at a time as records are +added. The database's limits are reasonable: - +
    maximum record size65535 bytes
    maximum number of records per database264 - 1 bytes
    maximum number of records per database264 - 1
    maximum number of tags per record1024
    -

    To provide support for -32-bit machines, and to not create grossly huge and unwieldy data files, -the database layer splits data and index files into many "segments" -containing at most 64K entries each. Those of you doing some quick math -in your heads may note that this could cause a problem on 32-bit -machines — if a full segment contains entries of the maximum -length, you'll have to mmap 4GB (32-bit Linux gives each process only -3GB of virtual memory addressing). Right now, 32-bit users should -change RECORDS_PER_SEGMENT in config.h to -something lower like 32768. In the future, I might do something smart -like not mmaping the whole fracking file. +

    So as not to create grossly huge and unwieldy data files, the +database layer splits data and index files into many "segments" +containing at most 64K entries each. Those of you doing some quick +mental math may note that this could cause a problem on 32-bit machines +— if a full segment contains entries of the maximum length, you'll +have to mmap 4GB (32-bit Linux gives each process only 3GB of virtual +address space). Right now, 32-bit users should change +RECORDS_PER_SEGMENT in config.h to something +lower like 32768. In the future, I might do something smart like not +mmaping the whole fracking file. + +

    + + + + + +
    Record Index Structure
    offset (32-bit integer)
    length (16-bit integer)
    flags (16-bit integer)
    timestamp (32-bit integer)

    A record is stored by first appending the data to the data file, then -writing an index entry containing the offset and length of the data, as -well as the timestamp, to the index file. Since each index entry is -fixed length, we can find the index entry simply by multiplying the -record number we want by the size of the index entry. Upshot: -constant-time random-access reads and constant-time writes. As an added -bonus, because we're using append-only files, we get lockless reads. - -

    Tags are handled by a separate set of indices, one per tag. Each -index record simply stores the user and record number. Tags are -searched by opening the tag file, reading the last 50 entries or so, and -then reading all the records listed. Voila, fast tag lookups. +writing an entry in the index file containing the offset and length of +the data, as well as the timestamp. Since each index entry is fixed +length, we can find the index entry simply by multiplying the record +number we want by the size of the index entry. Upshot: constant-time +random-access reads and constant-time writes. As an added bonus, +because we're using append-only files, we get lockless reads. + + + + + +
    Tag Structure
    username (32 bytes)
    record number (64-bit integer)
    + +

    Tags are handled by a separate set of indices, one per tag. When a +record is added, it is scanned for tags, then entries are appended to +each tag index for the tags found. Each index record simply stores the +user and record number. Tags are searched by opening the tag file, +reading the last 50 entries or so, and then reading all the records +listed. Voila, fast tag lookups.

    At this point, you're probably thinking, "Is that it?" Yep, that's it. Blërg isn't revolutionary, it's just a system whose requirements @@ -398,7 +560,43 @@ disk before returning success. This should make Blërg extremely fast, and totally unreliable in a crash. But that's the way you want it, right? :] -

    Problems and Future Work

    +

    Subscriptions

    + +

    When I first started thinking about the idea of subscriptions, I +immediately came up with the naïve solution: keep a list of users to +which users are subscribed, then when you want to get updates, iterate +over the list and find the last entries for each user. And that would +work, but it's kind of costly in terms of disk I/O. I have to visit +each user in the list, retrieve their last few entries, and store them +somewhere else to be sorted later. And worse, that computation has to +be done every time a user checks their feed. As the number of users and +subscriptions grows, that will become a problem. + +

    So instead, I thought about it the other way around. Instead of doing +all the work when the request is received, Blërg tries to do as much as +possible by "pushing" updates to subscribed users. You can think of it +kind of like a mail system. When a user posts new content, a +notification is "sent" out to each of that user's subscribers. Later, +when the subscribers want to see what's new, they simply check their +mailbox. Checking your mailbox is usually a lot more efficient than +going around and checking everyone's records yourself, even with the +overhead of the "mailman." + +

    The "mailbox" is a subscription index, which is identical to a tag +index, but is a per-user construct. When a user posts a new record, a +subscription index record is written for every subscriber. It's a +similar amount of I/O as the naïve version above, but the important +difference is that it's only done once. Retrieving records for accounts +you're subscribed to is then as simple as reading your subscription +index and reading the associated records. This is hopefully less I/O +than the naïve version, since you're reading, at most, as many accounts +as you have records in the last N entries of your subscription index, +instead of all of them. And as an added bonus, since subscription index +records are added as posts are created, the subscription index is +automatically sorted by time! To support this "mail" architecture, we +also keep a list of subscribers and subscrib...ees in each account. + +

    Problems, Caveats, and Future Work

    Blërg probably doesn't actually work like Twitter because I've never actually had a Twitter account. @@ -407,20 +605,20 @@ actually had a Twitter account. Libmicrohttpd is small, but it's focused on embedded applications, so it often eschews speed for small memory footprint. This is especially apparent when you watch it chew through a POST request 300 bytes at a -time even though you've specified a buffer size of 256K. Http_blerg is -still pretty fast this way (on my 2GHz Opteron 246, blerg.httpd is still pretty fast this way — on my +2GHz Opteron 246, siege says it serves a 690-byte /get request at about 945 transactions per second, average -response time 0.05 seconds, with 100 concurrent accesses), but a -high-efficiency HTTP server implementation could knock this out of the -park. +response time 0.05 seconds, with 100 concurrent accesses — but a +fast HTTP server implementation could knock this out of the park.

    Libmicrohttpd is also really difficult to work with. If you look at -the code, http_blerg.c is about 70% longer than cgi_blerg.c simply -because of all the iterator hoops I had to jump through to process POST -requests. And if you can believe it, I wrote http_blerg.c first. If -I'd done it the other way around, I probably would have given up on -libmicrohttpd. :-/ +the code, http_blerg.c is about 70% longer than +cgi_blerg.c simply because of all the iterator hoops I had +to jump through to process POST requests. And if you can believe it, I +wrote http_blerg.c first. If I'd done it the other way +around, I probably would have given up on libmicrohttpd. :-/

    The data structures written to disk are dependent on the size and endianness of the primitive data types on your architecture and OS.