www/doc/index.html

   1 <!DOCTYPE html>
   2 <html>
   3 <head>
   4 <title>Blërg Documentation</title>
   5 <link rel="stylesheet" href="/css/doc.css">
   6 </head>
   7 <body>
   8
   9 <h1>Blërg</h1>
  10
  11 Blërg is a minimalistic tagged text document database engine that also
  12 pretends to be a <a href="/">microblogging system</a>.  It is designed
  13 to efficiently store small (&lt; 64K) pieces of text in a way that they
  14 can be quickly retrieved by record number or by querying for tags
  15 embedded in the text.  Its native interface is HTTP &mdash; Blërg comes
  16 as either a standalone HTTP server, or a CGI.  Blërg is written in pure
  17 C.
  18
  19 <ul class="toc">
  20   <li><a href="#installing">Installing</a>
  21     <ul>
  22       <li><a href="#getting_the_source">Getting the source</a></li>
  23       <li><a href="#requirements">Requirements</a></li>
  24       <li><a href="#configuring">Configuring</a></li>
  25       <li><a href="#building">Building</a></li>
  26       <li><a href="#installing">Installing</a></li>
  27     </ul>
  28   </li>
  29   <li><a href="#api">API</a>
  30     <ul>
  31       <li><a href="#api_definitions">API Definitions</a></li>
  32       <li><a href="#api_create">/create - create a new user</a></li>
  33       <li><a href="#api_login">/login - log in</a></li>
  34       <li><a href="#api_logout">/logout - log out</a></li>
  35       <li><a href="#api_put">/put - add a new record</a></li>
  36       <li><a href="#api_get">/get/(user), /get/(user)/(start record)-(end record) - get records for a user</a></li>
  37       <li><a href="#api_info">/info/(user) - Get information about a user</a></li>
  38       <li><a href="#api_tag">/tag/(#|H|@)(tagname) - Retrieve records containing tags</a></li>
  39     </ul>
  40   </li>
  41   <li><a href="#design">Design</a>
  42     <ul>
  43       <li><a href="#motivation">Motivation</a></li>
  44       <li><a href="#web_app_stack">Web App Stack</a></li>
  45       <li><a href="#database">Database</a></li>
  46       <li><a href="#problems">Problems and Future Work</a></li>
  47     </ul>
  48   </li>
  49 </ul>
  50
  51 <h2><a name="installing">Installing</a></h2>
  52
  53 <h3><a name="getting_the_source">Getting the source</a></h3>
  54
  55 <p>There's no stable release yet, but you can get everything currently
  56 running on blerg.dominionofawesome.com by cloning the git repository at
  57 http://git.bytex64.net/blerg.git.
  58
  59 <h3><a name="requirements">Requirements</a></h3>
  60
  61 <p>Blërg has varying requirements depending on how you want to run it
  62 &mdash; as a standalone HTTP server, or as a CGI.  You will need:
  63
  64 <ul>
  65 <li><a href="http://lloyd.github.com/yajl/">yajl</a> &gt;= 1.0.0
  66 (yajl is a JSON parser/generator written in C which, by some twisted
  67 sense of humor, requires ruby to compile)</li>
  68 </ul>
  69
  70 <p>As a standalone HTTP, server, you will also need:
  71
  72 <ul>
  73 <li><a href="http://www.gnu.org/software/libmicrohttpd/">GNU libmicrohttpd</a> &gt;= 0.9.3</li>
  74 </ul>
  75
  76 <p>Or, as a CGI, you will need:
  77
  78 <ul>
  79 <li><a href="http://www.newbreedsoftware.com/cgi-util/download/">cgi-util</a> &gt;= 2.2.1</li>
  80 </ul>
  81
  82 <h3><a name="configuring">Configuring</a></h3>
  83
  84 <p>I know I'm gonna get shit for not using an autoconf-based system, but
  85 I really didn't want to spend time figuring it out.  You should edit
  86 libs.mk and put in the paths where you can find headers and libraries
  87 for the above requirements.
  88
  89 <p>Also, further apologies to BSD folks &mdash; I've probably committed
  90 several unconscious Linux-isms.  It would not surprise me if the
  91 makefile refuses to work with BSD make, or if it fails to compile even
  92 with gmake.  If you have patches or suggestions on how to make Blërg
  93 more portable, I'd be happy to hear them.
  94
  95 <h3><a name="building">Building</a></h3>
  96
  97 <p>At this point, it should be gravy.  Type 'make' and in a few seconds,
  98 you should have <code>blerg.httpd</code>, <code>blerg.cgi</code>,
  99 <code>rss.cgi</code>, and <code>blergtool</code>.  Each of those can be
 100 made individually as well, if you, for example, don't want to install
 101 the prerequisites for <code>blerg.httpd</code> or
 102 <code>blerg.cgi</code>.
 103
 104 <h3><a name="installing">Installing</a></h3>
 105
 106 <p>While it's not strictly required, Blërg will be easier to set up if
 107 you configure it to work from the root of your website.  For this
 108 reason, it's better to use a subdomain (i.e., blerg.yoursite.com is
 109 easier than yoursite.com/blerg/).  If you do want to put it in a
 110 subdirectory, you will have to modify <code>www/js/blerg.js</code> and
 111 change baseURL at the top as well as a number of other self-references
 112 in that file and <code>www/index.html</code>.  The CGI version should
 113 work fine this way, but the HTTP version will require the request to be
 114 rewritten, as it expects to be serving from the root.
 115
 116 <p>You cannot serve the database and client from different domains
 117 (i.e., yoursite.com vs othersite.net, or even foo.yoursite.com and
 118 bar.yoursite.com).  This is a requirement of the web browser &mdash; the
 119 same origin policy will not allow an AJAX request to travel across
 120 domains.
 121
 122 <h4>For the standalone web server:</h4>
 123
 124 <p>Right now, <code>blerg.httpd</code> doesn't serve any static assets,
 125 so you're going to have to put it behind a real webserver like apache,
 126 lighttpd, nginx, or similar.  Set the document root to the www
 127 directory, then proxy /info, /create, /login, /logout, /get, /tag, and
 128 /put to blerg.httpd.  You can change the port <code>blerg.httpd</code>
 129 listens on in <code>config.h</code>.
 130
 131 <h4>For the CGI version:</h4>
 132
 133 <p>Copy the files in www/ to the root of your web server.  Copy
 134 <code>blerg.cgi</code> to your web server.  Included in www-configs/ is
 135 a .htaccess file for Apache that will rewrite the URLs.  If you need to
 136 call the CGI something other than <code>blerg.cgi</code>, the .htaccess
 137 file will need to be modified.
 138
 139 <h4>The extra RSS CGI</h4>
 140
 141 <p>There is an optional RSS cgi (<code>rss.cgi</code>) that will serve
 142 RSS feeds for users.  Install this like <code>blerg.cgi</code> above.
 143
 144
 145 <h2><a name="api">API</a></h2>
 146
 147 <p>Blërg's API was designed to be as simple as possible.  Data sent from
 148 the client is POSTed with the application/x-www-form-urlencoded
 149 encoding, and a successful response is always JSON.  The API endpoints
 150 will be described as though the server were serving requests from the
 151 root of the wesite.
 152
 153 <h3><a name="api_definitions">API Definitions</a></h3>
 154
 155 <p>On failure, all API calls return either a standard HTTP error
 156 response, like 404 Not Found if a record or user doesn't exist, or a 200
 157 response with a 'JSON failure', which will look like this:
 158
 159 <p><code>{"status": "failure"}</code>
 160
 161 <p>Blërg doesn't currently explain <i>why</i> there is a failure, and
 162 I'm not sure it ever will.
 163
 164 <p>On success, you'll either get some JSON relating to your request (for
 165 /get, /tag, or /info), or a 'JSON success' response (for /create, /put,
 166 /login, or /logout), which looks like this:
 167
 168 <p><code>{"status": "success"}</code>
 169
 170 <p>For the CGI backend, you may get a 500 error if something goes wrong.
 171 For the HTTP backend, you'll get nothing (since it will have crashed),
 172 or maybe a 502 Bad Gateway if you have it behind another web server.
 173
 174 <p>All usernames must be 32 characters or less.  Usernames must contain
 175 only the ASCII characters 0-9, A-Z, a-z, underscore (_), and hyphen (-).
 176 Passwords can be at most 64 bytes, and have no limits on characters (but
 177 beware: if you have a null in the middle, it will stop checking there
 178 because I use <code>strncmp(3)</code> to compare).
 179
 180 <p>Tags must be 64 characters or less, and can contain only the ASCII
 181 characters 0-9, A-Z, a-z, underscore (_), and hyphen (-).
 182
 183 <h3><a name="api_create">/create</a> - create a new user</a></h3>
 184
 185 <p>To create a user, POST to /create with <code>username</code> and
 186 <code>password</code> parameters for the new user.  The server will
 187 respond with JSON failure if the user exists, or if the user can't be
 188 created for some other reason.  The server will respond with JSON
 189 success if the user is created.
 190
 191 <h3><a name="api_login">/login</a> - log in</a></h3>
 192
 193 <p>POST to /login with the <code>username</code> and
 194 <code>password</code> parameters for an existing user.  The server will
 195 respond with JSON failure if the user does not exist or if the password
 196 is incorrect.  On success, the server will respond with JSON success,
 197 and will set a cookie named 'auth' that must be sent by the client when
 198 accessing restricted API functions (/put and /logout).
 199
 200 <h3><a name="api_logout">/logout</a> - log out</a></h3>
 201
 202 <p>POST to /logout with with <code>username</code>, the user to log out,
 203 along with the auth cookie in a Cookie header.  The server will respond
 204 with JSON failure if the user does not exist or if the auth cookie is
 205 bad.  The server will respond with JSON success after the user is
 206 successfully logged out.
 207
 208 <h3><a name="api_put">/put</a> - add a new record</a></h3>
 209
 210 <p>POST to /put with <code>username</code> and <code>data</code>
 211 parameters, and an auth cookie.  The server will respond with JSON
 212 failure if the auth cookie is bad, if the user doesn't exist, or if
 213 <code>data</code> contains more than 65535 bytes <i>after</i> URL
 214 decoding.  The server will respond with JSON success after the record is
 215 successfully added.
 216
 217 <h3><a name="api_get">/get/(user), /get/(user)/(start record)-(end record)</a> - get records for a user</a></h3>
 218
 219 <p>A GET request to /get/(user), where (user) is the user desired, will
 220 return the last 50 records for that user in a list of objects.  The
 221 record objects look like this:
 222
 223 <pre>
 224 {
 225   "record":"0",
 226   "timestamp":1294309438,
 227   "data":"eatin a taco on fifth street"
 228 }
 229 </pre>
 230
 231 <p><code>record</code> is the record number, <code>timestamp</code> is
 232 the UNIX epoch timestamp (i.e., the number of seconds since Jan 1 1970
 233 00:00:00 GMT), and <code>data</code> is the content of the record.  The
 234 record number is sent as a string because while Blërg supports record
 235 numbers up to 2<sup>64</sup> - 1, Javascript uses floating point for all
 236 its numbers, and can only support integers without truncation up to
 237 2<sup>53</sup>.  This difference is largely academic, but I didn't want
 238 this problem to sneak up on anyone who is more insane than I am. :]
 239
 240 <p>The second form, /get/(user)/(start record)-(end record), retrieves a
 241 specific range of records, from (start record) to (end record)
 242 inclusive.  You can retrieve at most 100 records this way.  If (end
 243 record) - (start record) specifies more than 100 records, or if the
 244 range specifies invalid records, or if the end record is before the
 245 start record, the server will respond with JSON failure.
 246
 247 <h3><a name="api_info">/info/(user)</a> - Get information about a user</a></h3>
 248
 249 <p>A GET request to /info/(user) will return a JSON object with
 250 information about the user (currently only the number of records).  The
 251 info object looks like this:
 252
 253 <pre>
 254 {
 255   "record_count": "544"
 256 }
 257 </pre>
 258
 259 <p>Again, the record count is sent as a string for 64-bit safety.
 260
 261 <h3><a name="api_tag">/tag/(#|H|@)(tagname)</a> - Retrieve records containing tags</a></h3>
 262
 263 <p>A GET request to this endpoint will return the last 50 records
 264 associated with the given tag.  The first character is either # or H for
 265 hashtags, or @ for mentions (I call them ref tags).  You should URL
 266 encode the # or @, lest some servers complain at you.  The H alias for #
 267 was created because Apache helpfully strips the fragment of a URL
 268 (everything from the # to the end) before handing it off to the CGI,
 269 even if the hash is URL encoded.  The record objects also contain an
 270 extra <code>author</code> field, like so:
 271
 272 <pre>
 273 {
 274   "author":"Jon",
 275   "record":"57",
 276   "timestamp":1294555793,
 277   "data":"I'm taking #garfield to the vet."
 278 }
 279 </pre>
 280
 281 <p>There is currently no support for getting more than 50 tags, but /tag
 282 will probably mutate to work like /get.
 283
 284 <h2><a name="design">Design</a></h2>
 285
 286 <h3><a name="motivation">Motivation</a></h3>
 287
 288 <p>Blërg was created as the result of a thought experiment: "What if
 289 Twitter didn't need thousands of servers? What if its millions of users
 290 could be handled by a single highly efficient server?"  This is probably
 291 an unreachable goal due to the sheer amount of I/O, but we can certainly
 292 try to do better.  Blërg was thus designed as a system with very simple
 293 requirements:
 294
 295 <ol>
 296 <li>Store and fetch small chunks of text efficiently</li>
 297 <li>Create fast indexes for hash tags and @ mentions</li>
 298 <li>Provide a HTTP interface web apps can use</li>
 299 </ol>
 300
 301 <p>And to further simplify, I didn't bother handling deletes, full text
 302 search, or more complicated tag searches.  Blërg only does the basics.
 303
 304 <h3><a name="web_app_stack">Web App Stack</a></h3>
 305
 306 <table class="pizzapie">
 307 <tr><th>Classical model</th></tr>
 308 <tr>
 309   <td style="background-color: blue; color: white"><b>Client App</b><br>HTML/Javascript</td>
 310 </tr>
 311 <tr>
 312   <td style="background-color: #9F0000; color: white"><b>Webserver</b><br>Apache, lighttpd, nginx, etc.</td>
 313 </tr>
 314 <tr>
 315   <td style="background-color: #009F00; color: white"><b>Server App</b><br>Python, Perl, Ruby, etc.</td>
 316 </tr>
 317 <tr>
 318   <td style="background-color: #404040; color: white"><b>Database</b><br>MySQL, PostgreSQL, MongoDB, CouchDB, etc.</td>
 319 </tr>
 320 </table>
 321
 322 <p>Modern web applications have at least a four-layer approach.  You
 323 have the client-side browser app, the web server, the server-side
 324 application, and the database.  Your data goes through a lot of layers
 325 before it actually resides on disk somewhere (or, as they're calling it
 326 these days, "The Cloud" *waves hands*).  Each of those layers requires
 327 some amount of computing resources, so to increase throughput, we must
 328 make the layers more efficient, or reduce the number of layers.
 329
 330 <table class="pizzapie">
 331 <tr><th>Blërg model</th></tr>
 332 <tr>
 333   <td style="background-color: blue; color: white"><b>Blërg Client App</b><br>HTML/Javascript</td>
 334 </tr>
 335 <tr>
 336   <td style="background-color: #404040; color: white"><b>Blërg Database</b><br>Fuckin' hardcore C and shit</td>
 337 </tr>
 338 </table>
 339
 340 <p>Blërg does both by smashing the last two or three layers into one
 341 application.  Blërg can be run as either a standalone web server, or as
 342 a CGI (FastCGI support is planned, but I just don't care right now).
 343 Less waste, more throughput.  As a consequence of this, the entirety of
 344 the application logic that the user sees is implemented in the client
 345 app in Javascript.  That's why all the URLs have #'s &mdash; the page is
 346 loaded once and switched on the fly to show different views, further
 347 reducing load on the server.  Even parsing hash tags and URLs are done
 348 in client JS.
 349
 350 <p>The API is simple and pragmatic.  It's not entirely RESTful, but is
 351 rather designed to work well with web-based front-ends.  Client data is
 352 always POSTed with the usual application/x-www-form-urlencoded encoding,
 353 and server data is always returned in JSON format.
 354
 355 <p>The HTTP interface to the database idea has already been done by <a
 356 href="http://couchdb.apache.org/">CouchDB</a>, though I didn't know that
 357 until after I wrote Blërg. :)
 358
 359 <h3><a name="database">Database</a></h3>
 360
 361 <p>I was impressed by <a
 362 href="http://www.varnish-cache.org/">varnish</a>'s design, so I decided
 363 early in the design process that I'd try out mmaped I/O.  Each user in
 364 Blërg has their own database, which consists of a metdata file, and one
 365 or more data and index files.  The data and index files are memory
 366 mapped, which hopefully makes things more efficient by letting the OS
 367 handle when to read from disk (or maybe not &mdash I haven't benchmarked
 368 it).  The index files are preallocated because I believe it's more
 369 efficient than writing to it 40 bytes at a time as records are added.
 370 The database's limits are reasonable:
 371
 372 <table class="statistics">
 373 <tr><td>maximum record size</td><td>65535 bytes</td></tr>
 374 <tr><td>maximum number of records per database</td><td>2<sup>64</sup> - 1 bytes</td></tr>
 375 <tr><td>maximum number of tags per record</td><td>1024</td></tr>
 376 <table>
 377
 378 <p>So as not to create grossly huge and unwieldy data files, the
 379 database layer splits data and index files into many "segments"
 380 containing at most 64K entries each.  Those of you doing some quick math
 381 in your heads may note that this could cause a problem on 32-bit
 382 machines &mdash; if a full segment contains entries of the maximum
 383 length, you'll have to mmap 4GB (32-bit Linux gives each process only
 384 3GB of virtual address space).  Right now, 32-bit users should change
 385 <code>RECORDS_PER_SEGMENT</code> in <code>config.h</code> to something
 386 lower like 32768.  In the future, I might do something smart like not
 387 mmaping the whole fracking file.
 388
 389 <table class="bitstructure">
 390 <tr><th>Record Index Structure</th></tr>
 391 <tr><td class="B4">offset (32-bit integer)</td></tr>
 392 <tr><td class="B2">length (16-bit integer)</td></tr>
 393 <tr><td class="B2">flags (16-bit integer)</td></tr>
 394 <tr><td class="B4">timestamp (32-bit integer)</td></tr>
 395 </table>
 396
 397 <p>A record is stored by first appending the data to the data file, then
 398 writing an entry in the index file containing the offset and length of
 399 the data, as well as the timestamp.  Since each index entry is fixed
 400 length, we can find the index entry simply by multiplying the record
 401 number we want by the size of the index entry.  Upshot: constant-time
 402 random-access reads and constant-time writes.  As an added bonus,
 403 because we're using append-only files, we get lockless reads.
 404
 405 <table class="bitstructure">
 406 <tr><th>Tag Structure</th></tr>
 407 <tr><td class="B32">username (32 bytes)</td></tr>
 408 <tr><td class="B8">record number (64-bit integer)</td></tr>
 409 </table>
 410
 411 <p>Tags are handled by a separate set of indices, one per tag.  When a
 412 record is added, it is scanned for tags, then entries are appended to
 413 each tag index for the tags found.  Each index record simply stores the
 414 user and record number.  Tags are searched by opening the tag file,
 415 reading the last 50 entries or so, and then reading all the records
 416 listed.  Voila, fast tag lookups.
 417
 418 <p>At this point, you're probably thinking, "Is that it?"  Yep, that's
 419 it.  Blërg isn't revolutionary, it's just a system whose requirements
 420 were pared down until the implementation could be made dead simple.
 421
 422 <p>Also, keeping with the style of modern object databases, I haven't
 423 implemented any data safety (har har).  Blërg does not sync anything to
 424 disk before returning success.  This should make Blërg extremely fast,
 425 and totally unreliable in a crash.  But that's the way you want it,
 426 right? :]
 427
 428 <h3><a name="problems">Problems, Caveats, and Future Work</a></h3>
 429
 430 <p>Blërg probably doesn't actually work like Twitter because I've never
 431 actually had a Twitter account.
 432
 433 <p>I couldn't find a really good fast HTTP server library.
 434 Libmicrohttpd is small, but it's focused on embedded applications, so it
 435 often eschews speed for small memory footprint.  This is especially
 436 apparent when you watch it chew through a POST request 300 bytes at a
 437 time even though you've specified a buffer size of 256K.
 438 <code>blerg.httpd</code> is still pretty fast this way &mdash; on my
 439 2GHz Opteron 246, <a
 440 href="http://www.joedog.org/index/siege-home">siege</a> says it serves a
 441 690-byte /get request at about 945 transactions per second, average
 442 response time 0.05 seconds, with 100 concurrent accesses &mdash; but a
 443 fast HTTP server implementation could knock this out of the park.
 444
 445 <p>Libmicrohttpd is also really difficult to work with.  If you look at
 446 the code, <code>http_blerg.c</code> is about 70% longer than
 447 <code>cgi_blerg.c</code> simply because of all the iterator hoops I had
 448 to jump through to process POST requests.  And if you can believe it, I
 449 wrote <code>http_blerg.c</code> first. If I'd done it the other way
 450 around, I probably would have given up on libmicrohttpd. :-/
 451
 452 <p>The data structures written to disk are dependent on the size and
 453 endianness of the primitive data types on your architecture and OS.
 454 This means that the databases are not portable.  A dump/import tool is
 455 probably the easiest way to handle this.
 456
 457 <p>I do want to make a FastCGI version eventually, and this will
 458 probably be a rather simple modification of cgi_blerg.
 459
 460 <p>Implementing deletes will be... interesting.  There is room in the
 461 record index for a 'deleted' flag, but the problem is deleting any tags
 462 referenced in the data.  This requires rescanning the record content and
 463 putting a 'deleted' flag in the tag indices.  This will not be pretty,
 464 so I'm just going to ignore it and hope nobody makes any mistakes. ;]
 465
 466 <p>Tag indices can grow arbitrarily large, which will cause problems for
 467 32-bit machines around the 3GB mark.  Still, that's something like 80
 468 million tags, so maybe it's not something to worry about.
 469
 470 <p>The API currently requires the client to transmit the user's password
 471 in the clear.  A digest-based authentication scheme would be better,
 472 though for real security, the app should run over HTTPS.
 473
 474 </body>
 475 </html>