X-Git-Url: http://git.bytex64.net/?a=blobdiff_plain;f=www%2Fdoc%2Findex.html;h=e98ef47f0e0804b0dbca756781828d5bc61731f4;hb=974dbf20da7aec573384615db50f7e03e56c1667;hp=fb2ddc7d4f2085047f67e1f9a0e701984be9884c;hpb=4f3fa0594ee551da29b1ea0ed20a076410959d91;p=blerg.git diff --git a/www/doc/index.html b/www/doc/index.html index fb2ddc7..e98ef47 100644 --- a/www/doc/index.html +++ b/www/doc/index.html @@ -26,12 +26,30 @@ C.
  • Installing
  • +
  • API + +
  • Design
  • @@ -40,7 +58,7 @@ C.

    Getting the source

    -

    There's no stable release, yet, but you can get everything currently +

    There's no stable release yet, but you can get everything currently running on blerg.dominionofawesome.com by cloning the git repository at http://git.bytex64.net/blerg.git. @@ -69,57 +87,275 @@ sense of humor, requires ruby to compile)

    Configuring

    -

    I know I'm gonna get shit for not using an autoconf-based system, but -I really didn't want to waste time figuring it out. You should edit -libs.mk and put in the paths where you can find headers and libraries -for the above requirements. +

    There is now an experimental autoconf build system. If you run +add-autoconf, it'll do the magic and create a +configure script that'll do the familiar things. If I ever +get around to distributing source packages, you should find that this +has already been done. + +

    If you'd rather stick with the manual system, you should edit libs.mk +and put in the paths where you can find headers and libraries for the +above requirements.

    Also, further apologies to BSD folks — I've probably committed several unconscious Linux-isms. It would not surprise me if the -makefile refuses to work with BSD make. If you have patches or -suggestions on how to make Blërg more portable, I'd be happy to hear -them. +makefile refuses to work with BSD make, or if it fails to compile even +with gmake. If you have patches or suggestions on how to make Blërg +more portable, I'd be happy to hear them.

    Building

    At this point, it should be gravy. Type 'make' and in a few seconds, -you should have http_blerg, cgi_blerg, -rss, and blergtool. Each of those can be made -individually as well, if you, for example, don't want to install the -prerequisites for http_blerg or cgi_blerg. +you should have blerg.httpd, blerg.cgi, +rss.cgi, and blergtool. Each of those can be +made individually as well, if you, for example, don't want to install +the prerequisites for blerg.httpd or +blerg.cgi. + +

    NOTE: blerg.httpd is deprecated and will not be +updated with new features.

    Installing

    -

    While it's not required, Blërg will be easier to set up if you -configure it to work from the root of your website. For this reason, -it's better to use a subdomain (i.e., blerg.yoursite.com is easier than -yoursite.com/blerg/). If you do want to put it in a subdirectory, you -will have to modify www/js/blerg.js and change baseURL at the top. The -CGI version should work fine this way, but the HTTP version will require -the request to be rewritten, as it expects to be serving from the root. +

    While it's not strictly required, Blërg will be easier to set up if +you configure it to work from the root of your website. For this +reason, it's better to use a subdomain (i.e., blerg.yoursite.com is +easier than yoursite.com/blerg/). If you do want to put it in a +subdirectory, you will have to modify www/js/blerg.js and +change baseURL at the top as well as a number of other self-references +in that file and www/index.html. The CGI version should +work fine this way, but the HTTP version will require the request to be +rewritten, as it expects to be serving from the root. + +

    You cannot serve the database and client from different domains +(i.e., yoursite.com vs othersite.net, or even foo.yoursite.com and +bar.yoursite.com). This is a requirement of the web browser — the +same origin policy will not allow an AJAX request to travel across +domains.

    For the standalone web server:

    -

    Right now, http_blerg doesn't serve any static assets, so you're -going to have to put it behind a real webserver like apache, lighttpd, -nginx, or similar. Set the document root to the www directory, then -proxy /info, /create, /login, /logout, /get, /tag, and /put to -http_blerg. +

    Right now, blerg.httpd doesn't serve any static assets, +so you're going to have to put it behind a real webserver like apache, +lighttpd, nginx, or similar. Set the document root to the www +directory, then proxy /info, /create, /login, /logout, /get, /tag, and +/put to blerg.httpd. You can change the port blerg.httpd +listens on in config.h.

    For the CGI version:

    -

    Copy the files in www to the root of your web server. Copy cgi_blerg -to blerg.cgi somewhere on your web server. Included in www-configs is a -.htaccess file for apache that will rewrite the URLs. If you need to -call cgi_blerg something other than blerg.cgi, the .htaccess file will -need to be modified. +

    Copy the files in www/ to the root of your web server. Copy +blerg.cgi to your web server. Included in www-configs/ is +a .htaccess file for Apache that will rewrite the URLs. If you need to +call the CGI something other than blerg.cgi, the .htaccess +file will need to be modified.

    The extra RSS CGI

    -

    There is an optional RSS cgi (called simply rss) that will serve RSS -feeds for users. Install this like the CGI version above (on my server, -it's at /rss.cgi). +

    There is an optional RSS cgi (rss.cgi) that will serve +RSS feeds for users. Install this like blerg.cgi above. + + +

    API

    + +

    Blërg's API was designed to be as simple as possible. Data sent from +the client is POSTed with the application/x-www-form-urlencoded +encoding, and a successful response is always JSON. The API endpoints +will be described as though the server were serving requests from the +root of the wesite. + +

    API Definitions

    + +

    On failure, all API calls return either a standard HTTP error +response, like 404 Not Found if a record or user doesn't exist, or a 200 +response with a 'JSON failure', which will look like this: + +

    {"status": "failure"} + +

    Blërg doesn't currently explain why there is a failure, and +I'm not sure it ever will. +

    On success, you'll either get some JSON relating to your request (for +/get, /tag, or /info), or a 'JSON success' response (for /create, /put, +/login, or /logout), which looks like this: + +

    {"status": "success"} + +

    For the CGI backend, you may get a 500 error if something goes wrong. +For the HTTP backend, you'll get nothing (since it will have crashed), +or maybe a 502 Bad Gateway if you have it behind another web server. + +

    All usernames must be 32 characters or less. Usernames must contain +only the ASCII characters 0-9, A-Z, a-z, underscore (_), and hyphen (-). +Passwords can be at most 64 bytes, and have no limits on characters (but +beware: if you have a null in the middle, it will stop checking there +because I use strncmp(3) to compare). + +

    Tags must be 64 characters or less, and can contain only the ASCII +characters 0-9, A-Z, a-z, underscore (_), and hyphen (-). + +

    /create - create a new user

    + +

    To create a user, POST to /create with username and +password parameters for the new user. The server will +respond with JSON failure if the user exists, or if the user can't be +created for some other reason. The server will respond with JSON +success if the user is created. + +

    /login - log in

    + +

    POST to /login with the username and +password parameters for an existing user. The server will +respond with JSON failure if the user does not exist or if the password +is incorrect. On success, the server will respond with JSON success, +and will set a cookie named 'auth' that must be sent by the client when +accessing restricted API functions (/put and /logout). + +

    /logout - log out

    + +

    POST to /logout with with username, the user to log out, +along with the auth cookie in a Cookie header. The server will respond +with JSON failure if the user does not exist or if the auth cookie is +bad. The server will respond with JSON success after the user is +successfully logged out. + +

    /put - add a new record

    + +

    POST to /put with username and data +parameters, and an auth cookie. The server will respond with JSON +failure if the auth cookie is bad, if the user doesn't exist, or if +data contains more than 65535 bytes after URL +decoding. The server will respond with JSON success after the record is +successfully added. + +

    /get/(user), /get/(user)/(start record)-(end record) - get records for a user

    + +

    A GET request to /get/(user), where (user) is the user desired, will +return the last 50 records for that user in a list of objects. The +record objects look like this: + +

    +{
    +  "record":"0",
    +  "timestamp":1294309438,
    +  "data":"eatin a taco on fifth street"
    +}
    +
    + +

    record is the record number, timestamp is +the UNIX epoch timestamp (i.e., the number of seconds since Jan 1 1970 +00:00:00 GMT), and data is the content of the record. The +record number is sent as a string because while Blërg supports record +numbers up to 264 - 1, Javascript uses floating point for all +its numbers, and can only support integers without truncation up to +253. This difference is largely academic, but I didn't want +this problem to sneak up on anyone who is more insane than I am. :] + +

    The second form, /get/(user)/(start record)-(end record), retrieves a +specific range of records, from (start record) to (end record) +inclusive. You can retrieve at most 100 records this way. If (end +record) - (start record) specifies more than 100 records, or if the +range specifies invalid records, or if the end record is before the +start record, the server will respond with JSON failure. + +

    /info/(user) - Get information about a user

    + +

    A GET request to /info/(user) will return a JSON object with +information about the user (currently only the number of records). The +info object looks like this: + +

    +{
    +  "record_count": "544"
    +}
    +
    + +

    Again, the record count is sent as a string for 64-bit safety. + +

    /tag/(#|H|@)(tagname) - Retrieve records containing tags

    + +

    A GET request to this endpoint will return the last 50 records +associated with the given tag. The first character is either # or H for +hashtags, or @ for mentions (I call them ref tags). You should URL +encode the # or @, lest some servers complain at you. The H alias for # +was created because Apache helpfully strips the fragment of a URL +(everything from the # to the end) before handing it off to the CGI, +even if the hash is URL encoded. The record objects also contain an +extra author field, like so: + +

    +{
    +  "author":"Jon",
    +  "record":"57",
    +  "timestamp":1294555793,
    +  "data":"I'm taking #garfield to the vet."
    +}
    +
    + +

    There is currently no support for getting more than 50 tags, but /tag +will probably mutate to work like /get. + +

    /subscribe/(user) - Subscribe to a +user's updates

    + +

    POST to /subscribe/(user) with a username parameter and +an auth cookie, where (user) is the user whose updates you wish to +subscribe to. The server will respond with JSON failure if the auth +cookie is bad or if the user doesn't exist. The server will respond +with JSON success after the subscription is successfully registered. + +

    /unsubscribe/(user) - Unsubscribe from +a user's updates

    + +

    Identical to /subscribe, but removes the subscription. + +

    /feed - Get updates for subscribed users

    + +

    POST to /feed, with a username parameter and an auth +cookie. The server will respond with a JSON list of the last 50 updates +from all subscribed users, in reverse chronological order. Fetching +/feed resets the new message count returned from /feedinfo. + +

    NOTE: subscription notifications are only stored while subscriptions +are active. Any records inserted before or after a subscription is +active will not show up in /feed. + +

    /feedinfo, /feedinfo/(user) - Get subscription +status for a user

    + +

    POST to /feedinfo with a username parameter and an auth +cookie to get general information about your subscribed feeds. +Currently, this only tells you how many new records there are since the +last time /feed was fetched. The server will respond with a JSON +object: + +

    +{"new":3}
    +
    + +

    POST to /feedinfo/(user) with a username parameter and +an auth cookie, where (user) is a user whose subscription status you are +interested in. The server will respond with a simple JSON object: + +

    +{"subscribed":true}
    +
    + +

    The value of "subscribed" will be either true or false depending on +the subscription status. + +

    /passwd - Change a user's password

    + +

    POST to /passwd with a username parameter and an auth +cookie, plus password and new_password +parameters to change the user's password. For extra protection, +changing a password requires sending the user's current password in the +password parameter. If authentication is successful and +the password matches, the user's password is set to +new_password and the server responds with JSON success. + +If the password doesn't match, or one of password or +new_password are missing, the server returns JSON failure.

    Design

    @@ -128,9 +364,9 @@ it's at /rss.cgi).

    Blërg was created as the result of a thought experiment: "What if Twitter didn't need thousands of servers? What if its millions of users could be handled by a single highly efficient server?" This is probably -an unreachable goal due to the sheer amount of I/O, but we could -certainly do better. Blërg was thus designed as a system with very -simple requirements: +an unreachable goal due to the sheer amount of I/O, but we can certainly +try to do better. Blërg was thus designed as a system with very simple +requirements:

    1. Store and fetch small chunks of text efficiently
    2. @@ -160,11 +396,12 @@ search, or more complicated tag searches. Blërg only does the basics.

      Modern web applications have at least a four-layer approach. You -have the client-side browser app written in HTML and Javascript, the web -server, the server-side application typically written in some scripting -language (or, if it's high-performance, ASP/Java/C/C++), and the -database (usually SQL, but newer web apps seem to love object-oriented -DBs). +have the client-side browser app, the web server, the server-side +application, and the database. Your data goes through a lot of layers +before it actually resides on disk somewhere (or, as they're calling it +these days, "The Cloud" *waves hands*). Each of those layers requires +some amount of computing resources, so to increase throughput, we must +make the layers more efficient, or reduce the number of layers. @@ -172,18 +409,20 @@ DBs). - +
      Blërg model
      Blërg Client App
      HTML/Javascript
      Blërg Database
      Blërg Database
      Fuckin' hardcore C and shit
      -

      Blërg compresses the last two or three layers into one application. -Blërg can be run as either a standalone web server, or as a CGI (FastCGI -support is planned, but I just don't care right now). Less waste, more -throughput. As a consequence of this, the entirety of the application -logic that the user sees is implemented in the client app in Javascript. -That's why all the URLs have #'s — the page is loaded once and -switched on the fly to show different views, further reducing load on -the server. Even parsing hash tags and URLs are done in client JS. +

      Blërg does both by smashing the last two or three layers into one +application. Blërg can be run as either a standalone web server +(currently deprecated because maintaining two versions is hard), or as a +CGI (FastCGI support is planned, but I just don't care right now). Less +waste, more throughput. As a consequence of this, the entirety of the +application logic that the user sees is implemented in the client app in +Javascript. That's why all the URLs have #'s — the page is loaded +once and switched on the fly to show different views, further reducing +load on the server. Even parsing hash tags and URLs are done in client +JS.

      The API is simple and pragmatic. It's not entirely RESTful, but is rather designed to work well with web-based front-ends. Client data is @@ -196,48 +435,62 @@ until after I wrote Blërg. :)

      Database

      -

      Early in the design process, I decided to blatantly copy varnish and rely heavily on -mmap for I/O. Each user in Blërg has their own database, which consists -of one or more data and index files, and a metadata file. When a -database is opened, only the metadata is actually read (currently a -single 64-bit integer keeping track of the last record id). The data -and index files are memory mapped, which hopefully makes things more -efficient by letting the OS handle when to read from disk. The index -files are preallocated because I believe it's more efficient than -writing to it 40 bytes at a time as records are added. Here's some info -on the database's limitations: +

      I was impressed by varnish's design, so I decided +early in the design process that I'd try out mmaped I/O. Each user in +Blërg has their own database, which consists of a metdata file, and one +or more data and index files. The data and index files are memory +mapped, which hopefully makes things more efficient by letting the OS +handle when to read from disk (or maybe not — I haven't +benchmarked it). The index files are preallocated because I believe +it's more efficient than writing to it 40 bytes at a time as records are +added. The database's limits are reasonable: - +
      maximum record size65535 bytes
      maximum number of records per database264 - 1 bytes
      maximum number of records per database264 - 1
      maximum number of tags per record1024
      -

      To provide support for -32-bit machines, and to not create grossly huge and unwieldy data files, -the database layer splits data and index files into many "segments" -containing at most 64K entries each. Those of you doing some quick math -in your heads may note that this could cause a problem on 32-bit -machines — if a full segment contains entries of the maximum -length, you'll have to mmap 4GB (32-bit Linux gives each process only -3GB of virtual memory addressing). Right now, 32-bit users should -change RECORDS_PER_SEGMENT in config.h to -something lower like 32768. In the future, I might do something smart -like not mmaping the whole fracking file. +

      So as not to create grossly huge and unwieldy data files, the +database layer splits data and index files into many "segments" +containing at most 64K entries each. Those of you doing some quick +mental math may note that this could cause a problem on 32-bit machines +— if a full segment contains entries of the maximum length, you'll +have to mmap 4GB (32-bit Linux gives each process only 3GB of virtual +address space). Right now, 32-bit users should change +RECORDS_PER_SEGMENT in config.h to something +lower like 32768. In the future, I might do something smart like not +mmaping the whole fracking file. + +

      + + + + + +
      Record Index Structure
      offset (32-bit integer)
      length (16-bit integer)
      flags (16-bit integer)
      timestamp (32-bit integer)

      A record is stored by first appending the data to the data file, then -writing an index entry containing the offset and length of the data, as -well as the timestamp, to the index file. Since each index entry is -fixed length, we can find the index entry simply by multiplying the -record number we want by the size of the index entry. Upshot: -constant-time random-access reads and constant-time writes. As an added -bonus, because we're using append-only files, we get lockless reads. - -

      Tags are handled by a separate set of indices, one per tag. Each -index record simply stores the user and record number. Tags are -searched by opening the tag file, reading the last 50 entries or so, and -then reading all the records listed. Voila, fast tag lookups. +writing an entry in the index file containing the offset and length of +the data, as well as the timestamp. Since each index entry is fixed +length, we can find the index entry simply by multiplying the record +number we want by the size of the index entry. Upshot: constant-time +random-access reads and constant-time writes. As an added bonus, +because we're using append-only files, we get lockless reads. + + + + + +
      Tag Structure
      username (32 bytes)
      record number (64-bit integer)
      + +

      Tags are handled by a separate set of indices, one per tag. When a +record is added, it is scanned for tags, then entries are appended to +each tag index for the tags found. Each index record simply stores the +user and record number. Tags are searched by opening the tag file, +reading the last 50 entries or so, and then reading all the records +listed. Voila, fast tag lookups.

      At this point, you're probably thinking, "Is that it?" Yep, that's it. Blërg isn't revolutionary, it's just a system whose requirements @@ -249,7 +502,43 @@ disk before returning success. This should make Blërg extremely fast, and totally unreliable in a crash. But that's the way you want it, right? :] -

      Problems and Future Work

      +

      Subscriptions

      + +

      When I first started thinking about the idea of subscriptions, I +immediately came up with the naïve solution: keep a list of users to +which users are subscribed, then when you want to get updates, iterate +over the list and find the last entries for each user. And that would +work, but it's kind of costly in terms of disk I/O. I have to visit +each user in the list, retrieve their last few entries, and store them +somewhere else to be sorted later. And worse, that computation has to +be done every time a user checks their feed. As the number of users and +subscriptions grows, that will become a problem. + +

      So instead, I thought about it the other way around. Instead of doing +all the work when the request is received, Blërg tries to do as much as +possible by "pushing" updates to subscribed users. You can think of it +kind of like a mail system. When a user posts new content, a +notification is "sent" out to each of that user's subscribers. Later, +when the subscribers want to see what's new, they simply check their +mailbox. Checking your mailbox is usually a lot more efficient than +going around and checking everyone's records yourself, even with the +overhead of the "mailman." + +

      The "mailbox" is a subscription index, which is identical to a tag +index, but is a per-user construct. When a user posts a new record, a +subscription index record is written for every subscriber. It's a +similar amount of I/O as the naïve version above, but the important +difference is that it's only done once. Retrieving records for accounts +you're subscribed to is then as simple as reading your subscription +index and reading the associated records. This is hopefully less I/O +than the naïve version, since you're reading, at most, as many accounts +as you have records in the last N entries of your subscription index, +instead of all of them. And as an added bonus, since subscription index +records are added as posts are created, the subscription index is +automatically sorted by time! To support this "mail" architecture, we +also keep a list of subscribers and subscrib...ees in each account. + +

      Problems, Caveats, and Future Work

      Blërg probably doesn't actually work like Twitter because I've never actually had a Twitter account. @@ -258,20 +547,20 @@ actually had a Twitter account. Libmicrohttpd is small, but it's focused on embedded applications, so it often eschews speed for small memory footprint. This is especially apparent when you watch it chew through a POST request 300 bytes at a -time even though you've specified a buffer size of 256K. Http_blerg is -still pretty fast this way (on my 2GHz Opteron 246, blerg.httpd is still pretty fast this way — on my +2GHz Opteron 246, siege says it serves a 690-byte /get request at about 945 transactions per second, average -response time 0.05 seconds, with 100 concurrent accesses), but a -high-efficiency HTTP server implementation could knock this out of the -park. +response time 0.05 seconds, with 100 concurrent accesses — but a +fast HTTP server implementation could knock this out of the park.

      Libmicrohttpd is also really difficult to work with. If you look at -the code, http_blerg.c is about 70% longer than cgi_blerg.c simply -because of all the iterator hoops I had to jump through to process POST -requests. And if you can believe it, I wrote http_blerg.c first. If -I'd done it the other way around, I probably would have given up on -libmicrohttpd. :-/ +the code, http_blerg.c is about 70% longer than +cgi_blerg.c simply because of all the iterator hoops I had +to jump through to process POST requests. And if you can believe it, I +wrote http_blerg.c first. If I'd done it the other way +around, I probably would have given up on libmicrohttpd. :-/

      The data structures written to disk are dependent on the size and endianness of the primitive data types on your architecture and OS.