www/doc/index.html

   1 <!DOCTYPE html>
   2 <html>
   3 <head>
   4 <title>Blërg Documentation</title>
   5 <link rel="stylesheet" href="/css/doc.css">
   6 </head>
   7 <body>
   8
   9 <h1>Blërg</h1>
  10
  11 Blërg is a minimalistic tagged text document database engine that also
  12 pretends to be a <a href="/">microblogging system</a>.  It is designed
  13 to efficiently store small (&lt; 64K) pieces of text in a way that they
  14 can be quickly retrieved by record number or by querying for tags
  15 embedded in the text.  Its native interface is HTTP &mdash; Blërg comes
  16 as either a standalone HTTP server, or a CGI.  Blërg is written in pure
  17 C.
  18
  19 <ul class="toc">
  20   <li><a href="#installing">Installing</a>
  21     <ul>
  22       <li><a href="#getting_the_source">Getting the source</a></li>
  23       <li><a href="#requirements">Requirements</a></li>
  24       <li><a href="#configuring">Configuring</a></li>
  25       <li><a href="#building">Building</a></li>
  26       <li><a href="#installing">Installing</a></li>
  27     </ul>
  28   </li>
  29   <li><a href="#design">Design</a>
  30     <ul>
  31       <li><a href="#motivation">Motivation</a></li>
  32       <li><a href="#web_app_stack">Web App Stack</a></li>
  33       <li><a href="#database">Database</a></li>
  34       <li><a href="#problems_and_future_work">Problems and Future Work</a></li>
  35     </ul>
  36   </li>
  37 </ul>
  38
  39 <h2><a name="installing">Installing</a></h2>
  40
  41 <h3><a name="getting_the_source">Getting the source</a></h3>
  42
  43 <p>There's no stable release, yet, but you can get everything currently
  44 running on blerg.dominionofawesome.com by cloning the git repository at
  45 http://git.bytex64.net/blerg.git.
  46
  47 <h3><a name="requirements">Requirements</a></h3>
  48
  49 <p>Blërg has varying requirements depending on how you want to run it
  50 &mdash; as a standalone HTTP server, or as a CGI.  You will need:
  51
  52 <ul>
  53 <li><a href="http://lloyd.github.com/yajl/">yajl</a> &gt;= 1.0.0
  54 (yajl is a JSON parser/generator written in C which, by some twisted
  55 sense of humor, requires ruby to compile)</li>
  56 </ul>
  57
  58 <p>As a standalone HTTP, server, you will also need:
  59
  60 <ul>
  61 <li><a href="http://www.gnu.org/software/libmicrohttpd/">GNU libmicrohttpd</a> &gt;= 0.9.3</li>
  62 </ul>
  63
  64 <p>Or, as a CGI, you will need:
  65
  66 <ul>
  67 <li><a href="http://www.newbreedsoftware.com/cgi-util/download/">cgi-util</a> &gt;= 2.2.1</li>
  68 </ul>
  69
  70 <h3><a name="configuring">Configuring</a></h3>
  71
  72 <p>I know I'm gonna get shit for not using an autoconf-based system, but
  73 I really didn't want to waste time figuring it out.  You should edit
  74 libs.mk and put in the paths where you can find headers and libraries
  75 for the above requirements.
  76
  77 <p>Also, further apologies to BSD folks &mdash; I've probably committed
  78 several unconscious Linux-isms.  It would not surprise me if the
  79 makefile refuses to work with BSD make.  If you have patches or
  80 suggestions on how to make Blërg more portable, I'd be happy to hear
  81 them.
  82
  83 <h3><a name="building">Building</a></h3>
  84
  85 <p>At this point, it should be gravy.  Type 'make' and in a few seconds,
  86 you should have <code>http_blerg</code>, <code>cgi_blerg</code>,
  87 <code>rss</code>, and <code>blergtool</code>.  Each of those can be made
  88 individually as well, if you, for example, don't want to install the
  89 prerequisites for <code>http_blerg</code> or <code>cgi_blerg</code>.
  90
  91 <h3><a name="installing">Installing</a></h3>
  92
  93 <p>While it's not required, Blërg will be easier to set up if you
  94 configure it to work from the root of your website.  For this reason,
  95 it's better to use a subdomain (i.e., blerg.yoursite.com is easier than
  96 yoursite.com/blerg/).  If you do want to put it in a subdirectory, you
  97 will have to modify www/js/blerg.js and change baseURL at the top.  The
  98 CGI version should work fine this way, but the HTTP version will require
  99 the request to be rewritten, as it expects to be serving from the root.
 100
 101 <h4>For the standalone web server:</h4>
 102
 103 <p>Right now, http_blerg doesn't serve any static assets, so you're
 104 going to have to put it behind a real webserver like apache, lighttpd,
 105 nginx, or similar.  Set the document root to the www directory, then
 106 proxy /info, /create, /login, /logout, /get, /tag, and /put to
 107 http_blerg.
 108
 109 <h4>For the CGI version:</h4>
 110
 111 <p>Copy the files in www to the root of your web server.  Copy cgi_blerg
 112 to blerg.cgi somewhere on your web server.  Included in www-configs is a
 113 .htaccess file for apache that will rewrite the URLs.  If you need to
 114 call cgi_blerg something other than blerg.cgi, the .htaccess file will
 115 need to be modified.
 116
 117 <h4>The extra RSS CGI</h4>
 118
 119 <p>There is an optional RSS cgi (called simply rss) that will serve RSS
 120 feeds for users.  Install this like the CGI version above (on my server,
 121 it's at /rss.cgi).
 122
 123
 124 <h2><a name="api">API</a></h2>
 125
 126 <p>Blërg's API was designed to be as simple as possible.  Data sent from
 127 the client is POSTed with the application/x-www-form-urlencoded
 128 encoding, and a successful response is always JSON.  The API endpoints
 129 will be described as though the server were serving requests from the
 130 root of the wesite.
 131
 132 <h3><a name="api_definitions">API Definitions</a></h3>
 133
 134 <p>On failure, all API calls return either a standard HTTP error
 135 response, like 404 Not Found if a record or user doesn't exist, or a 200
 136 response with some JSON indicating failure, which will look like this:
 137
 138 <p><code>{"status": "failure"}</code>
 139
 140 <p>Blërg doesn't currently explain <i>why</i> there is a failure, and
 141 I'm not sure it ever will.
 142
 143 <p>On success, you'll either get some JSON relating to your request (for
 144 /get, /tag, or /info), or a JSON object indicating success (for /create,
 145 /put, /login, or /logout), which looks like this:
 146
 147 <p><code>{"status": "success"}</code>
 148
 149 <p>For the CGI backend, you may get a 500 error if something goes wrong.
 150 For the HTTP backend, you'll get nothing (since it will have crashed),
 151 or maybe a 502 Bad Gateway if you have it behind another web server.
 152
 153 <p>All usernames must be 32 characters or less.  Usernames must contain
 154 only the ASCII characters 0-9, A-Z, a-z, underscore (_), period (.),
 155 hyphen (-), single quote ('), and space ( ).  Passwords can be at most
 156 64 bytes, and have no limits on characters (but beware: if you have a
 157 null in the middle, it will stop checking there because I use
 158 <code>strncmp(3)</code> to compare).
 159
 160 <p>Tags must be 64 characters or less, and can contain only the ASCII
 161 characters 0-9, A-Z, a-z, hyphen (-), and underscore (_).
 162
 163 <h3><a name="api_create">/create</a> - create a new user</a></h3>
 164
 165 <p>To create a user, POST to /create with <code>username</code> and
 166 <code>password</code> parameters for the new user.  The server will
 167 respond with failure if the user exists, or if the user can't be created
 168 for some other reason.  The server will respond with success if the user
 169 is created.
 170
 171 <h3><a name="api_login">/login</a> - log in</a></h3>
 172
 173 <p>POST to /login with the <code>username</code> and
 174 <code>password</code> parameters for an existing user.  The server will
 175 respond with failure if the user does not exist or if the password is
 176 incorrect.  On success, the server will respond with success, and will
 177 set a cookie named 'auth' that must be sent by the client when accessing
 178 restricted API functions (/put and /logout).
 179
 180 <h3><a name="api_logout">/logout</a> - log out</a></h3>
 181
 182 <p>POST to /logout with with <code>username</code>, the user to log out,
 183 along with the auth cookie in a Cookie header.  The server will respond
 184 with failure if the user does not exist or if the auth cookie is bad.
 185 The server will respond with success after the user is successfully
 186 logged out.
 187
 188 <h3><a name="api_put">/put</a> - add a new record</a></h3>
 189
 190 <p>POST to /put with <code>username</code> and <code>data</code>
 191 parameters, and an auth cookie.  The server will respond with failure
 192 if the auth cookie is bad, if the user doesn't exist, or if
 193 <code>data</code> contains more than 65535 bytes <i>after</i> URL
 194 decoding.  The server will respond with success after the record is
 195 successfully added.
 196
 197 <h3><a name="api_get">/get/(user), /get/(user)/(start record)-(end record)</a> - get records for a user</a></h3>
 198
 199 <p>A GET request to /get/(user), where (user) is the user desired, will
 200 return the last 50 records for that user in a list of objects.  The
 201 record objects look like this:
 202
 203 <pre>
 204 {
 205   "record":"0",
 206   "timestamp":1294309438,
 207   "data":"eatin a taco on fifth street"
 208 }
 209 </pre>
 210
 211 <p><code>record</code> is the record number, <code>timestamp</code> is
 212 the UNIX epoch timestamp (i.e., the number of seconds since Jan 1 1970
 213 00:00:00 GMT), and <code>data</code> is the content of the record.  The
 214 record number is sent as a string because while Blërg supports record
 215 numbers up to 2<sup>64</sup> - 1, Javascript uses floating point for all
 216 its numbers, and can only support integers without truncation up to
 217 2<sup>53</sup>.  This difference is largely academic, but I didn't want
 218 this problem to sneak up on anyone who is more insane than I am. :]
 219
 220 <p>The second form, /get/(user)/(start record)-(end record), retrieves a
 221 specific range of records, from (start record) to (end record)
 222 inclusive.  You can retrieve at most 100 records this way.  If (end
 223 record) - (start record) specifies more than 100 records, the server
 224 will respond with JSON failure.
 225
 226 <h3><a name="api_info">/info/(user)</a> - Get information about a user</a></h3>
 227
 228 <p>A GET request to /info/(user) will return a JSON object with
 229 information about the user (currently only the number of records).  The
 230 info object looks like this:
 231
 232 <pre>
 233 {
 234   "record_count": "544"
 235 }
 236 </pre>
 237
 238 <p>Again, the record count is sent as a string for 64-bit safety.
 239
 240 <h3><a name="api_tag">/tag/(#|H|@)(tagname)</a> - Retrieve records containing tags</a></h3>
 241
 242 <p>A GET request to this endpoint will return the last 50 records
 243 associated with the given tag.  The first character is either # or H for
 244 hashtags, or @ for mentions (I call them ref tags).  You should URL
 245 encode the # or @, lest some servers complain at you.  The H alias for #
 246 was created because Apache helpfully strips the fragment of a URL
 247 (everything from the # to the end) before handing it off to the CGI,
 248 even if the hash is URL encoded.  The record objects also contain an
 249 extra <code>author</code> field, like so:
 250
 251 {
 252   "author":"Jon",
 253   "record":"57",
 254   "timestamp":1294555793,
 255   "data":"I'm taking #garfield to the vet."
 256 }
 257
 258 <p>There is currently no support for getting more than 50 tags, but /tag
 259 will probably mutate to work like /get.
 260
 261 <h2><a name="design">Design</a></h2>
 262
 263 <h3><a name="motivation">Motivation</a></h3>
 264
 265 <p>Blërg was created as the result of a thought experiment: "What if
 266 Twitter didn't need thousands of servers? What if its millions of users
 267 could be handled by a single highly efficient server?"  This is probably
 268 an unreachable goal due to the sheer amount of I/O, but we could
 269 certainly do better.  Blërg was thus designed as a system with very
 270 simple requirements:
 271
 272 <ol>
 273 <li>Store and fetch small chunks of text efficiently</li>
 274 <li>Create fast indexes for hash tags and @ mentions</li>
 275 <li>Provide a HTTP interface web apps can use</li>
 276 </ol>
 277
 278 <p>And to further simplify, I didn't bother handling deletes, full text
 279 search, or more complicated tag searches.  Blërg only does the basics.
 280
 281 <h3><a name="web_app_stack">Web App Stack</a></h3>
 282
 283 <table class="pizzapie">
 284 <tr><th>Classical model</th></tr>
 285 <tr>
 286   <td style="background-color: blue; color: white"><b>Client App</b><br>HTML/Javascript</td>
 287 </tr>
 288 <tr>
 289   <td style="background-color: #9F0000; color: white"><b>Webserver</b><br>Apache, lighttpd, nginx, etc.</td>
 290 </tr>
 291 <tr>
 292   <td style="background-color: #009F00; color: white"><b>Server App</b><br>Python, Perl, Ruby, etc.</td>
 293 </tr>
 294 <tr>
 295   <td style="background-color: #404040; color: white"><b>Database</b><br>MySQL, PostgreSQL, MongoDB, CouchDB, etc.</td>
 296 </tr>
 297 </table>
 298
 299 <p>Modern web applications have at least a four-layer approach.  You
 300 have the client-side browser app written in HTML and Javascript, the web
 301 server, the server-side application typically written in some scripting
 302 language (or, if it's high-performance, ASP/Java/C/C++), and the
 303 database (usually SQL, but newer web apps seem to love object-oriented
 304 DBs).
 305
 306 <table class="pizzapie">
 307 <tr><th>Blërg model</th></tr>
 308 <tr>
 309   <td style="background-color: blue; color: white"><b>Blërg Client App</b><br>HTML/Javascript</td>
 310 </tr>
 311 <tr>
 312   <td style="background-color: #404040; color: white"><b>Blërg Database</b><br></td>
 313 </tr>
 314 </table>
 315
 316 <p>Blërg compresses the last two or three layers into one application.
 317 Blërg can be run as either a standalone web server, or as a CGI (FastCGI
 318 support is planned, but I just don't care right now).  Less waste, more
 319 throughput.  As a consequence of this, the entirety of the application
 320 logic that the user sees is implemented in the client app in Javascript.
 321 That's why all the URLs have #'s &mdash; the page is loaded once and
 322 switched on the fly to show different views, further reducing load on
 323 the server.  Even parsing hash tags and URLs are done in client JS.
 324
 325 <p>The API is simple and pragmatic.  It's not entirely RESTful, but is
 326 rather designed to work well with web-based front-ends.  Client data is
 327 always POSTed with the usual application/x-www-form-urlencoded encoding,
 328 and server data is always returned in JSON format.
 329
 330 <p>The HTTP interface to the database idea has already been done by <a
 331 href="http://couchdb.apache.org/">CouchDB</a>, though I didn't know that
 332 until after I wrote Blërg. :)
 333
 334 <h3><a name="database">Database</a></h3>
 335
 336 <p>Early in the design process, I decided to blatantly copy <a
 337 href="http://www.varnish-cache.org/">varnish</a> and rely heavily on
 338 mmap for I/O.  Each user in Blërg has their own database, which consists
 339 of one or more data and index files, and a metadata file.  When a
 340 database is opened, only the metadata is actually read (currently a
 341 single 64-bit integer keeping track of the last record id).  The data
 342 and index files are memory mapped, which hopefully makes things more
 343 efficient by letting the OS handle when to read from disk.  The index
 344 files are preallocated because I believe it's more efficient than
 345 writing to it 40 bytes at a time as records are added.  Here's some info
 346 on the database's limitations:
 347
 348 <table class="statistics">
 349 <tr><td>maximum record size</td><td>65535 bytes</td></tr>
 350 <tr><td>maximum number of records per database</td><td>2<sup>64</sup> - 1 bytes</td></tr>
 351 <tr><td>maximum number of tags per record</td><td>1024</td></tr>
 352 <table>
 353
 354 <p>To provide support for
 355 32-bit machines, and to not create grossly huge and unwieldy data files,
 356 the database layer splits data and index files into many "segments"
 357 containing at most 64K entries each.  Those of you doing some quick math
 358 in your heads may note that this could cause a problem on 32-bit
 359 machines &mdash; if a full segment contains entries of the maximum
 360 length, you'll have to mmap 4GB (32-bit Linux gives each process only
 361 3GB of virtual memory addressing).  Right now, 32-bit users should
 362 change <code>RECORDS_PER_SEGMENT</code> in <code>config.h</code> to
 363 something lower like 32768.  In the future, I might do something smart
 364 like not mmaping the whole fracking file.
 365
 366 <p>A record is stored by first appending the data to the data file, then
 367 writing an index entry containing the offset and length of the data, as
 368 well as the timestamp, to the index file.  Since each index entry is
 369 fixed length, we can find the index entry simply by multiplying the
 370 record number we want by the size of the index entry.  Upshot:
 371 constant-time random-access reads and constant-time writes.  As an added
 372 bonus, because we're using append-only files, we get lockless reads.
 373
 374 <p>Tags are handled by a separate set of indices, one per tag.  Each
 375 index record simply stores the user and record number.  Tags are
 376 searched by opening the tag file, reading the last 50 entries or so, and
 377 then reading all the records listed.  Voila, fast tag lookups.
 378
 379 <p>At this point, you're probably thinking, "Is that it?"  Yep, that's
 380 it.  Blërg isn't revolutionary, it's just a system whose requirements
 381 were pared down until the implementation could be made dead simple.
 382
 383 <p>Also, keeping with the style of modern object databases, I haven't
 384 implemented any data safety (har har).  Blërg does not sync anything to
 385 disk before returning success.  This should make Blërg extremely fast,
 386 and totally unreliable in a crash.  But that's the way you want it,
 387 right? :]
 388
 389 <h3><a name="problems_and_future_work">Problems and Future Work</a></h3>
 390
 391 <p>Blërg probably doesn't actually work like Twitter because I've never
 392 actually had a Twitter account.
 393
 394 <p>I couldn't find a really good fast HTTP server library.
 395 Libmicrohttpd is small, but it's focused on embedded applications, so it
 396 often eschews speed for small memory footprint.  This is especially
 397 apparent when you watch it chew through a POST request 300 bytes at a
 398 time even though you've specified a buffer size of 256K.  Http_blerg is
 399 still pretty fast this way (on my 2GHz Opteron 246, <a
 400 href="http://www.joedog.org/index/siege-home">siege</a> says it serves a
 401 690-byte /get request at about 945 transactions per second, average
 402 response time 0.05 seconds, with 100 concurrent accesses), but a
 403 high-efficiency HTTP server implementation could knock this out of the
 404 park.
 405
 406 <p>Libmicrohttpd is also really difficult to work with.  If you look at
 407 the code, http_blerg.c is about 70% longer than cgi_blerg.c simply
 408 because of all the iterator hoops I had to jump through to process POST
 409 requests.  And if you can believe it, I wrote http_blerg.c first. If
 410 I'd done it the other way around, I probably would have given up on
 411 libmicrohttpd. :-/
 412
 413 <p>The data structures written to disk are dependent on the size and
 414 endianness of the primitive data types on your architecture and OS.
 415 This means that the databases are not portable.  A dump/import tool is
 416 probably the easiest way to handle this.
 417
 418 <p>I do want to make a FastCGI version eventually, and this will
 419 probably be a rather simple modification of cgi_blerg.
 420
 421 <p>Implementing deletes will be... interesting.  There is room in the
 422 record index for a 'deleted' flag, but the problem is deleting any tags
 423 referenced in the data.  This requires rescanning the record content and
 424 putting a 'deleted' flag in the tag indices.  This will not be pretty,
 425 so I'm just going to ignore it and hope nobody makes any mistakes. ;]
 426
 427 <p>Tag indices can grow arbitrarily large, which will cause problems for
 428 32-bit machines around the 3GB mark.  Still, that's something like 80
 429 million tags, so maybe it's not something to worry about.
 430
 431 <p>The API currently requires the client to transmit the user's password
 432 in the clear.  A digest-based authentication scheme would be better,
 433 though for real security, the app should run over HTTPS.
 434
 435 </body>
 436 </html>