www/doc/index.html

   1 <!DOCTYPE html>
   2 <html>
   3 <head>
   4 <title>Blërg Documentation</title>
   5 <link rel="stylesheet" href="/css/doc.css">
   6 </head>
   7 <body>
   8
   9 <h1>Blërg</h1>
  10
  11 Blërg is a minimalistic tagged text document database engine that also
  12 pretends to be a <a href="/">microblogging system</a>.  It is designed
  13 to efficiently store small (&lt; 64K) pieces of text in a way that they
  14 can be quickly retrieved by record number or by querying for tags
  15 embedded in the text.  Its native interface is HTTP &mdash; Blërg comes
  16 as either a standalone HTTP server, or a CGI.  Blërg is written in pure
  17 C.
  18
  19 <ul class="toc">
  20   <li><a href="#installing">Installing</a>
  21     <ul>
  22       <li><a href="#getting_the_source">Getting the source</a></li>
  23       <li><a href="#requirements">Requirements</a></li>
  24       <li><a href="#configuring">Configuring</a></li>
  25       <li><a href="#building">Building</a></li>
  26       <li><a href="#installing">Installing</a></li>
  27     </ul>
  28   </li>
  29   <li><a href="#api">API</a>
  30     <ul>
  31       <li><a href="#api_definitions">API Definitions</a></li>
  32       <li><a href="#api_create">/create - create a new user</a></li>
  33       <li><a href="#api_login">/login - log in</a></li>
  34       <li><a href="#api_logout">/logout - log out</a></li>
  35       <li><a href="#api_put">/put - add a new record</a></li>
  36       <li><a href="#api_get">/get/(user), /get/(user)/(start record)-(end record) - get records for a user</a></li>
  37       <li><a href="#api_info">/info/(user) - Get information about a user</a></li>
  38       <li><a href="#api_tag">/tag/(#|H|@)(tagname) - Retrieve records containing tags</a></li>
  39       <li><a href="#api_subscribe">/subscribe/(user) - Subscribe to a user's updates</a></li>
  40       <li><a href="#api_unsubscribe">/unsubscribe/(user) - Unsubscribe from a user's updates</a></li>
  41       <li><a href="#api_feed">/feed - Get updates for subscribed users</a></li>
  42       <li><a href="#api_feedinfo">/feedinfo, /feedinfo/(user) - Get subscription status</a></li>
  43       <li><a href="#api_passwd">/passwd - Change a user's password</a></li>
  44     </ul>
  45   </li>
  46   <li><a href="#design">Design</a>
  47     <ul>
  48       <li><a href="#motivation">Motivation</a></li>
  49       <li><a href="#web_app_stack">Web App Stack</a></li>
  50       <li><a href="#database">Database</a></li>
  51       <li><a href="#subscriptions">Subscriptions</a></li>
  52       <li><a href="#problems">Problems and Future Work</a></li>
  53     </ul>
  54   </li>
  55 </ul>
  56
  57 <h2><a name="installing">Installing</a></h2>
  58
  59 <h3><a name="getting_the_source">Getting the source</a></h3>
  60
  61 <p>There's no stable release yet, but you can get everything currently
  62 running on blerg.dominionofawesome.com by cloning the git repository at
  63 http://git.bytex64.net/blerg.git.
  64
  65 <h3><a name="requirements">Requirements</a></h3>
  66
  67 <p>Blërg has varying requirements depending on how you want to run it
  68 &mdash; as a standalone HTTP server, or as a CGI.  You will need:
  69
  70 <ul>
  71 <li><a href="http://lloyd.github.com/yajl/">yajl</a> &gt;= 1.0.0
  72 (yajl is a JSON parser/generator written in C which, by some twisted
  73 sense of humor, requires ruby to compile)</li>
  74 </ul>
  75
  76 <p>As a standalone HTTP, server, you will also need:
  77
  78 <ul>
  79 <li><a href="http://www.gnu.org/software/libmicrohttpd/">GNU libmicrohttpd</a> &gt;= 0.9.3</li>
  80 </ul>
  81
  82 <p>Or, as a CGI, you will need:
  83
  84 <ul>
  85 <li><a href="http://www.newbreedsoftware.com/cgi-util/download/">cgi-util</a> &gt;= 2.2.1</li>
  86 </ul>
  87
  88 <h3><a name="configuring">Configuring</a></h3>
  89
  90 <p>There is now an experimental autoconf build system.  If you run
  91 <code>add-autoconf</code>, it'll do the magic and create a
  92 <code>configure</code> script that'll do the familiar things.  If I ever
  93 get around to distributing source packages, you should find that this
  94 has already been done.
  95
  96 <p>If you'd rather stick with the manual system, you should edit libs.mk
  97 and put in the paths where you can find headers and libraries for the
  98 above requirements.
  99
 100 <p>Also, further apologies to BSD folks &mdash; I've probably committed
 101 several unconscious Linux-isms.  It would not surprise me if the
 102 makefile refuses to work with BSD make, or if it fails to compile even
 103 with gmake.  If you have patches or suggestions on how to make Blërg
 104 more portable, I'd be happy to hear them.
 105
 106 <h3><a name="building">Building</a></h3>
 107
 108 <p>At this point, it should be gravy.  Type 'make' and in a few seconds,
 109 you should have <code>blerg.httpd</code>, <code>blerg.cgi</code>,
 110 <code>rss.cgi</code>, and <code>blergtool</code>.  Each of those can be
 111 made individually as well, if you, for example, don't want to install
 112 the prerequisites for <code>blerg.httpd</code> or
 113 <code>blerg.cgi</code>.
 114
 115 <p><strong>NOTE</strong>: blerg.httpd is deprecated and will not be
 116 updated with new features.
 117
 118 <h3><a name="installing">Installing</a></h3>
 119
 120 <p>While it's not strictly required, Blërg will be easier to set up if
 121 you configure it to work from the root of your website.  For this
 122 reason, it's better to use a subdomain (i.e., blerg.yoursite.com is
 123 easier than yoursite.com/blerg/).  If you do want to put it in a
 124 subdirectory, you will have to modify <code>www/js/blerg.js</code> and
 125 change baseURL at the top as well as a number of other self-references
 126 in that file and <code>www/index.html</code>.  The CGI version should
 127 work fine this way, but the HTTP version will require the request to be
 128 rewritten, as it expects to be serving from the root.
 129
 130 <p>You cannot serve the database and client from different domains
 131 (i.e., yoursite.com vs othersite.net, or even foo.yoursite.com and
 132 bar.yoursite.com).  This is a requirement of the web browser &mdash; the
 133 same origin policy will not allow an AJAX request to travel across
 134 domains.
 135
 136 <h4>For the standalone web server:</h4>
 137
 138 <p>Right now, <code>blerg.httpd</code> doesn't serve any static assets,
 139 so you're going to have to put it behind a real webserver like apache,
 140 lighttpd, nginx, or similar.  Set the document root to the www
 141 directory, then proxy /info, /create, /login, /logout, /get, /tag, and
 142 /put to blerg.httpd.  You can change the port <code>blerg.httpd</code>
 143 listens on in <code>config.h</code>.
 144
 145 <h4>For the CGI version:</h4>
 146
 147 <p>Copy the files in www/ to the root of your web server.  Copy
 148 <code>blerg.cgi</code> to your web server.  Included in www-configs/ is
 149 a .htaccess file for Apache that will rewrite the URLs.  If you need to
 150 call the CGI something other than <code>blerg.cgi</code>, the .htaccess
 151 file will need to be modified.
 152
 153 <h4>The extra RSS CGI</h4>
 154
 155 <p>There is an optional RSS cgi (<code>rss.cgi</code>) that will serve
 156 RSS feeds for users.  Install this like <code>blerg.cgi</code> above.
 157
 158
 159 <h2><a name="api">API</a></h2>
 160
 161 <p>Blërg's API was designed to be as simple as possible.  Data sent from
 162 the client is POSTed with the application/x-www-form-urlencoded
 163 encoding, and a successful response is always JSON.  The API endpoints
 164 will be described as though the server were serving requests from the
 165 root of the wesite.
 166
 167 <h3><a name="api_definitions">API Definitions</a></h3>
 168
 169 <p>On failure, all API calls return either a standard HTTP error
 170 response, like 404 Not Found if a record or user doesn't exist, or a 200
 171 response with a 'JSON failure', which will look like this:
 172
 173 <p><code>{"status": "failure"}</code>
 174
 175 <p>Blërg doesn't currently explain <i>why</i> there is a failure, and
 176 I'm not sure it ever will.
 177
 178 <p>On success, you'll either get some JSON relating to your request (for
 179 /get, /tag, or /info), or a 'JSON success' response (for /create, /put,
 180 /login, or /logout), which looks like this:
 181
 182 <p><code>{"status": "success"}</code>
 183
 184 <p>For the CGI backend, you may get a 500 error if something goes wrong.
 185 For the HTTP backend, you'll get nothing (since it will have crashed),
 186 or maybe a 502 Bad Gateway if you have it behind another web server.
 187
 188 <p>All usernames must be 32 characters or less.  Usernames must contain
 189 only the ASCII characters 0-9, A-Z, a-z, underscore (_), and hyphen (-).
 190 Passwords can be at most 64 bytes, and have no limits on characters (but
 191 beware: if you have a null in the middle, it will stop checking there
 192 because I use <code>strncmp(3)</code> to compare).
 193
 194 <p>Tags must be 64 characters or less, and can contain only the ASCII
 195 characters 0-9, A-Z, a-z, underscore (_), and hyphen (-).
 196
 197 <h3><a name="api_create">/create</a> - create a new user</a></h3>
 198
 199 <p>To create a user, POST to /create with <code>username</code> and
 200 <code>password</code> parameters for the new user.  The server will
 201 respond with JSON failure if the user exists, or if the user can't be
 202 created for some other reason.  The server will respond with JSON
 203 success if the user is created.
 204
 205 <h3><a name="api_login">/login</a> - log in</a></h3>
 206
 207 <p>POST to /login with the <code>username</code> and
 208 <code>password</code> parameters for an existing user.  The server will
 209 respond with JSON failure if the user does not exist or if the password
 210 is incorrect.  On success, the server will respond with JSON success,
 211 and will set a cookie named 'auth' that must be sent by the client when
 212 accessing restricted API functions (/put and /logout).
 213
 214 <h3><a name="api_logout">/logout</a> - log out</a></h3>
 215
 216 <p>POST to /logout with with <code>username</code>, the user to log out,
 217 along with the auth cookie in a Cookie header.  The server will respond
 218 with JSON failure if the user does not exist or if the auth cookie is
 219 bad.  The server will respond with JSON success after the user is
 220 successfully logged out.
 221
 222 <h3><a name="api_put">/put</a> - add a new record</a></h3>
 223
 224 <p>POST to /put with <code>username</code> and <code>data</code>
 225 parameters, and an auth cookie.  The server will respond with JSON
 226 failure if the auth cookie is bad, if the user doesn't exist, or if
 227 <code>data</code> contains more than 65535 bytes <i>after</i> URL
 228 decoding.  The server will respond with JSON success after the record is
 229 successfully added.
 230
 231 <h3><a name="api_get">/get/(user), /get/(user)/(start record)-(end record)</a> - get records for a user</a></h3>
 232
 233 <p>A GET request to /get/(user), where (user) is the user desired, will
 234 return the last 50 records for that user in a list of objects.  The
 235 record objects look like this:
 236
 237 <pre>
 238 {
 239   "record":"0",
 240   "timestamp":1294309438,
 241   "data":"eatin a taco on fifth street"
 242 }
 243 </pre>
 244
 245 <p><code>record</code> is the record number, <code>timestamp</code> is
 246 the UNIX epoch timestamp (i.e., the number of seconds since Jan 1 1970
 247 00:00:00 GMT), and <code>data</code> is the content of the record.  The
 248 record number is sent as a string because while Blërg supports record
 249 numbers up to 2<sup>64</sup> - 1, Javascript uses floating point for all
 250 its numbers, and can only support integers without truncation up to
 251 2<sup>53</sup>.  This difference is largely academic, but I didn't want
 252 this problem to sneak up on anyone who is more insane than I am. :]
 253
 254 <p>The second form, /get/(user)/(start record)-(end record), retrieves a
 255 specific range of records, from (start record) to (end record)
 256 inclusive.  You can retrieve at most 100 records this way.  If (end
 257 record) - (start record) specifies more than 100 records, or if the
 258 range specifies invalid records, or if the end record is before the
 259 start record, the server will respond with JSON failure.
 260
 261 <h3><a name="api_info">/info/(user)</a> - Get information about a user</a></h3>
 262
 263 <p>A GET request to /info/(user) will return a JSON object with
 264 information about the user (currently only the number of records).  The
 265 info object looks like this:
 266
 267 <pre>
 268 {
 269   "record_count": "544"
 270 }
 271 </pre>
 272
 273 <p>Again, the record count is sent as a string for 64-bit safety.
 274
 275 <h3><a name="api_tag">/tag/(#|H|@)(tagname)</a> - Retrieve records containing tags</a></h3>
 276
 277 <p>A GET request to this endpoint will return the last 50 records
 278 associated with the given tag.  The first character is either # or H for
 279 hashtags, or @ for mentions (I call them ref tags).  You should URL
 280 encode the # or @, lest some servers complain at you.  The H alias for #
 281 was created because Apache helpfully strips the fragment of a URL
 282 (everything from the # to the end) before handing it off to the CGI,
 283 even if the hash is URL encoded.  The record objects also contain an
 284 extra <code>author</code> field, like so:
 285
 286 <pre>
 287 {
 288   "author":"Jon",
 289   "record":"57",
 290   "timestamp":1294555793,
 291   "data":"I'm taking #garfield to the vet."
 292 }
 293 </pre>
 294
 295 <p>There is currently no support for getting more than 50 tags, but /tag
 296 will probably mutate to work like /get.
 297
 298 <h3><a name="api_subscribe">/subscribe/(user)</a> - Subscribe to a
 299 user's updates</a></h3>
 300
 301 <p>POST to /subscribe/(user) with a <code>username</code> parameter and
 302 an auth cookie, where (user) is the user whose updates you wish to
 303 subscribe to.  The server will respond with JSON failure if the auth
 304 cookie is bad or if the user doesn't exist.  The server will respond
 305 with JSON success after the subscription is successfully registered.
 306
 307 <h3><a name="api_unsubscribe">/unsubscribe/(user)</a> - Unsubscribe from
 308 a user's updates</h3>
 309
 310 <p>Identical to /subscribe, but removes the subscription.
 311
 312 <h3><a name="api_feed">/feed</a> - Get updates for subscribed users</h3>
 313
 314 <p>POST to /feed, with a <code>username</code> parameter and an auth
 315 cookie.  The server will respond with a JSON list of the last 50 updates
 316 from all subscribed users, in reverse chronological order.  Fetching
 317 /feed resets the new message count returned from /feedinfo.
 318
 319 <p>NOTE: subscription notifications are only stored while subscriptions
 320 are active.  Any records inserted before or after a subscription is
 321 active will not show up in /feed.
 322
 323 <h3><a name="api_feedinfo">/feedinfo, /feedinfo/(user)</a> - Get subscription
 324 status for a user</a></h3>
 325
 326 <p>POST to /feedinfo with a <code>username</code> parameter and an auth
 327 cookie to get general information about your subscribed feeds.
 328 Currently, this only tells you how many new records there are since the
 329 last time /feed was fetched.  The server will respond with a JSON
 330 object:
 331
 332 <pre>
 333 {"new":3}
 334 </pre>
 335
 336 <p>POST to /feedinfo/(user) with a <code>username</code> parameter and
 337 an auth cookie, where (user) is a user whose subscription status you are
 338 interested in.  The server will respond with a simple JSON object:
 339
 340 <pre>
 341 {"subscribed":true}
 342 </pre>
 343
 344 <p>The value of "subscribed" will be either true or false depending on
 345 the subscription status.
 346
 347 <h3><a name="api_passwd">/passwd</a> - Change a user's password</a></h3>
 348
 349 <p>POST to /passwd with a <code>username</code> parameter and an auth
 350 cookie, plus <code>password</code> and <code>new_password</code>
 351 parameters to change the user's password.  For extra protection,
 352 changing a password requires sending the user's current password in the
 353 <code>password</code> parameter.  If authentication is successful and
 354 the password matches, the user's password is set to
 355 <code>new_password</code> and the server responds with JSON success.
 356
 357 If the password doesn't match, or one of <code>password</code> or
 358 <code>new_password</code> are missing, the server returns JSON failure.
 359
 360 <h2><a name="design">Design</a></h2>
 361
 362 <h3><a name="motivation">Motivation</a></h3>
 363
 364 <p>Blërg was created as the result of a thought experiment: "What if
 365 Twitter didn't need thousands of servers? What if its millions of users
 366 could be handled by a single highly efficient server?"  This is probably
 367 an unreachable goal due to the sheer amount of I/O, but we can certainly
 368 try to do better.  Blërg was thus designed as a system with very simple
 369 requirements:
 370
 371 <ol>
 372 <li>Store and fetch small chunks of text efficiently</li>
 373 <li>Create fast indexes for hash tags and @ mentions</li>
 374 <li>Provide a HTTP interface web apps can use</li>
 375 </ol>
 376
 377 <p>And to further simplify, I didn't bother handling deletes, full text
 378 search, or more complicated tag searches.  Blërg only does the basics.
 379
 380 <h3><a name="web_app_stack">Web App Stack</a></h3>
 381
 382 <table class="pizzapie">
 383 <tr><th>Classical model</th></tr>
 384 <tr>
 385   <td style="background-color: blue; color: white"><b>Client App</b><br>HTML/Javascript</td>
 386 </tr>
 387 <tr>
 388   <td style="background-color: #9F0000; color: white"><b>Webserver</b><br>Apache, lighttpd, nginx, etc.</td>
 389 </tr>
 390 <tr>
 391   <td style="background-color: #009F00; color: white"><b>Server App</b><br>Python, Perl, Ruby, etc.</td>
 392 </tr>
 393 <tr>
 394   <td style="background-color: #404040; color: white"><b>Database</b><br>MySQL, PostgreSQL, MongoDB, CouchDB, etc.</td>
 395 </tr>
 396 </table>
 397
 398 <p>Modern web applications have at least a four-layer approach.  You
 399 have the client-side browser app, the web server, the server-side
 400 application, and the database.  Your data goes through a lot of layers
 401 before it actually resides on disk somewhere (or, as they're calling it
 402 these days, "The Cloud" *waves hands*).  Each of those layers requires
 403 some amount of computing resources, so to increase throughput, we must
 404 make the layers more efficient, or reduce the number of layers.
 405
 406 <table class="pizzapie">
 407 <tr><th>Blërg model</th></tr>
 408 <tr>
 409   <td style="background-color: blue; color: white"><b>Blërg Client App</b><br>HTML/Javascript</td>
 410 </tr>
 411 <tr>
 412   <td style="background-color: #404040; color: white"><b>Blërg Database</b><br>Fuckin' hardcore C and shit</td>
 413 </tr>
 414 </table>
 415
 416 <p>Blërg does both by smashing the last two or three layers into one
 417 application.  Blërg can be run as either a standalone web server
 418 (currently deprecated because maintaining two versions is hard), or as a
 419 CGI (FastCGI support is planned, but I just don't care right now).  Less
 420 waste, more throughput.  As a consequence of this, the entirety of the
 421 application logic that the user sees is implemented in the client app in
 422 Javascript.  That's why all the URLs have #'s &mdash; the page is loaded
 423 once and switched on the fly to show different views, further reducing
 424 load on the server.  Even parsing hash tags and URLs are done in client
 425 JS.
 426
 427 <p>The API is simple and pragmatic.  It's not entirely RESTful, but is
 428 rather designed to work well with web-based front-ends.  Client data is
 429 always POSTed with the usual application/x-www-form-urlencoded encoding,
 430 and server data is always returned in JSON format.
 431
 432 <p>The HTTP interface to the database idea has already been done by <a
 433 href="http://couchdb.apache.org/">CouchDB</a>, though I didn't know that
 434 until after I wrote Blërg. :)
 435
 436 <h3><a name="database">Database</a></h3>
 437
 438 <p>I was impressed by <a
 439 href="http://www.varnish-cache.org/">varnish</a>'s design, so I decided
 440 early in the design process that I'd try out mmaped I/O.  Each user in
 441 Blërg has their own database, which consists of a metdata file, and one
 442 or more data and index files.  The data and index files are memory
 443 mapped, which hopefully makes things more efficient by letting the OS
 444 handle when to read from disk (or maybe not &mdash; I haven't
 445 benchmarked it).  The index files are preallocated because I believe
 446 it's more efficient than writing to it 40 bytes at a time as records are
 447 added.  The database's limits are reasonable:
 448
 449 <table class="statistics">
 450 <tr><td>maximum record size</td><td>65535 bytes</td></tr>
 451 <tr><td>maximum number of records per database</td><td>2<sup>64</sup> - 1</td></tr>
 452 <tr><td>maximum number of tags per record</td><td>1024</td></tr>
 453 <table>
 454
 455 <p>So as not to create grossly huge and unwieldy data files, the
 456 database layer splits data and index files into many "segments"
 457 containing at most 64K entries each.  Those of you doing some quick
 458 mental math may note that this could cause a problem on 32-bit machines
 459 &mdash; if a full segment contains entries of the maximum length, you'll
 460 have to mmap 4GB (32-bit Linux gives each process only 3GB of virtual
 461 address space).  Right now, 32-bit users should change
 462 <code>RECORDS_PER_SEGMENT</code> in <code>config.h</code> to something
 463 lower like 32768.  In the future, I might do something smart like not
 464 mmaping the whole fracking file.
 465
 466 <table class="bitstructure">
 467 <tr><th>Record Index Structure</th></tr>
 468 <tr><td class="B4">offset (32-bit integer)</td></tr>
 469 <tr><td class="B2">length (16-bit integer)</td></tr>
 470 <tr><td class="B2">flags (16-bit integer)</td></tr>
 471 <tr><td class="B4">timestamp (32-bit integer)</td></tr>
 472 </table>
 473
 474 <p>A record is stored by first appending the data to the data file, then
 475 writing an entry in the index file containing the offset and length of
 476 the data, as well as the timestamp.  Since each index entry is fixed
 477 length, we can find the index entry simply by multiplying the record
 478 number we want by the size of the index entry.  Upshot: constant-time
 479 random-access reads and constant-time writes.  As an added bonus,
 480 because we're using append-only files, we get lockless reads.
 481
 482 <table class="bitstructure">
 483 <tr><th>Tag Structure</th></tr>
 484 <tr><td class="B32">username (32 bytes)</td></tr>
 485 <tr><td class="B8">record number (64-bit integer)</td></tr>
 486 </table>
 487
 488 <p>Tags are handled by a separate set of indices, one per tag.  When a
 489 record is added, it is scanned for tags, then entries are appended to
 490 each tag index for the tags found.  Each index record simply stores the
 491 user and record number.  Tags are searched by opening the tag file,
 492 reading the last 50 entries or so, and then reading all the records
 493 listed.  Voila, fast tag lookups.
 494
 495 <p>At this point, you're probably thinking, "Is that it?"  Yep, that's
 496 it.  Blërg isn't revolutionary, it's just a system whose requirements
 497 were pared down until the implementation could be made dead simple.
 498
 499 <p>Also, keeping with the style of modern object databases, I haven't
 500 implemented any data safety (har har).  Blërg does not sync anything to
 501 disk before returning success.  This should make Blërg extremely fast,
 502 and totally unreliable in a crash.  But that's the way you want it,
 503 right? :]
 504
 505 <h3><a name="subscriptions">Subscriptions</a></h3>
 506
 507 <p>When I first started thinking about the idea of subscriptions, I
 508 immediately came up with the naïve solution: keep a list of users to
 509 which users are subscribed, then when you want to get updates, iterate
 510 over the list and find the last entries for each user.  And that would
 511 work, but it's kind of costly in terms of disk I/O.  I have to visit
 512 each user in the list, retrieve their last few entries, and store them
 513 somewhere else to be sorted later.  And worse, that computation has to
 514 be done every time a user checks their feed. As the number of users and
 515 subscriptions grows, that will become a problem.
 516
 517 <p>So instead, I thought about it the other way around. Instead of doing
 518 all the work when the request is received, Blërg tries to do as much as
 519 possible by "pushing" updates to subscribed users.  You can think of it
 520 kind of like a mail system.  When a user posts new content, a
 521 notification is "sent" out to each of that user's subscribers.  Later,
 522 when the subscribers want to see what's new, they simply check their
 523 mailbox.  Checking your mailbox is usually a lot more efficient than
 524 going around and checking everyone's records yourself, even with the
 525 overhead of the "mailman."
 526
 527 <p>The "mailbox" is a subscription index, which is identical to a tag
 528 index, but is a per-user construct.  When a user posts a new record, a
 529 subscription index record is written for every subscriber.  It's a
 530 similar amount of I/O as the naïve version above, but the important
 531 difference is that it's only done once.  Retrieving records for accounts
 532 you're subscribed to is then as simple as reading your subscription
 533 index and reading the associated records.  This is hopefully less I/O
 534 than the naïve version, since you're reading, at most, as many accounts
 535 as you have records in the last N entries of your subscription index,
 536 instead of all of them.  And as an added bonus, since subscription index
 537 records are added as posts are created, the subscription index is
 538 automatically sorted by time!  To support this "mail" architecture, we
 539 also keep a list of subscribers and subscrib...ees in each account.
 540
 541 <h3><a name="problems">Problems, Caveats, and Future Work</a></h3>
 542
 543 <p>Blërg probably doesn't actually work like Twitter because I've never
 544 actually had a Twitter account.
 545
 546 <p>I couldn't find a really good fast HTTP server library.
 547 Libmicrohttpd is small, but it's focused on embedded applications, so it
 548 often eschews speed for small memory footprint.  This is especially
 549 apparent when you watch it chew through a POST request 300 bytes at a
 550 time even though you've specified a buffer size of 256K.
 551 <code>blerg.httpd</code> is still pretty fast this way &mdash; on my
 552 2GHz Opteron 246, <a
 553 href="http://www.joedog.org/index/siege-home">siege</a> says it serves a
 554 690-byte /get request at about 945 transactions per second, average
 555 response time 0.05 seconds, with 100 concurrent accesses &mdash; but a
 556 fast HTTP server implementation could knock this out of the park.
 557
 558 <p>Libmicrohttpd is also really difficult to work with.  If you look at
 559 the code, <code>http_blerg.c</code> is about 70% longer than
 560 <code>cgi_blerg.c</code> simply because of all the iterator hoops I had
 561 to jump through to process POST requests.  And if you can believe it, I
 562 wrote <code>http_blerg.c</code> first. If I'd done it the other way
 563 around, I probably would have given up on libmicrohttpd. :-/
 564
 565 <p>The data structures written to disk are dependent on the size and
 566 endianness of the primitive data types on your architecture and OS.
 567 This means that the databases are not portable.  A dump/import tool is
 568 probably the easiest way to handle this.
 569
 570 <p>I do want to make a FastCGI version eventually, and this will
 571 probably be a rather simple modification of cgi_blerg.
 572
 573 <p>Implementing deletes will be... interesting.  There is room in the
 574 record index for a 'deleted' flag, but the problem is deleting any tags
 575 referenced in the data.  This requires rescanning the record content and
 576 putting a 'deleted' flag in the tag indices.  This will not be pretty,
 577 so I'm just going to ignore it and hope nobody makes any mistakes. ;]
 578
 579 <p>Tag indices can grow arbitrarily large, which will cause problems for
 580 32-bit machines around the 3GB mark.  Still, that's something like 80
 581 million tags, so maybe it's not something to worry about.
 582
 583 <p>The API currently requires the client to transmit the user's password
 584 in the clear.  A digest-based authentication scheme would be better,
 585 though for real security, the app should run over HTTPS.
 586
 587 </body>
 588 </html>