4 <title>Blërg Documentation</title>
5 <link rel="stylesheet" href="/css/doc.css">
11 Blërg is a minimalistic tagged text document database engine that also
12 pretends to be a <a href="/">microblogging system</a>. It is designed
13 to efficiently store small (< 64K) pieces of text in a way that they
14 can be quickly retrieved by record number or by querying for tags
15 embedded in the text. Its native interface is HTTP — Blërg comes
16 as either a standalone HTTP server, or a CGI. Blërg is written in pure
20 <li><a href="#installing">Installing</a>
22 <li><a href="#getting_the_source">Getting the source</a></li>
23 <li><a href="#requirements">Requirements</a></li>
24 <li><a href="#configuring">Configuring</a></li>
25 <li><a href="#building">Building</a></li>
26 <li><a href="#installing">Installing</a></li>
29 <li><a href="#api">API</a>
31 <li><a href="#api_definitions">API Definitions</a></li>
32 <li><a href="#api_create">/create - create a new user</a></li>
33 <li><a href="#api_login">/login - log in</a></li>
34 <li><a href="#api_logout">/logout - log out</a></li>
35 <li><a href="#api_put">/put - add a new record</a></li>
36 <li><a href="#api_get">/get/(user), /get/(user)/(start record)-(end record) - get records for a user</a></li>
37 <li><a href="#api_info">/info/(user) - Get information about a user</a></li>
38 <li><a href="#api_tag">/tag/(#|H|@)(tagname) - Retrieve records containing tags</a></li>
41 <li><a href="#design">Design</a>
43 <li><a href="#motivation">Motivation</a></li>
44 <li><a href="#web_app_stack">Web App Stack</a></li>
45 <li><a href="#database">Database</a></li>
46 <li><a href="#problems">Problems and Future Work</a></li>
51 <h2><a name="installing">Installing</a></h2>
53 <h3><a name="getting_the_source">Getting the source</a></h3>
55 <p>There's no stable release yet, but you can get everything currently
56 running on blerg.dominionofawesome.com by cloning the git repository at
57 http://git.bytex64.net/blerg.git.
59 <h3><a name="requirements">Requirements</a></h3>
61 <p>Blërg has varying requirements depending on how you want to run it
62 — as a standalone HTTP server, or as a CGI. You will need:
65 <li><a href="http://lloyd.github.com/yajl/">yajl</a> >= 1.0.0
66 (yajl is a JSON parser/generator written in C which, by some twisted
67 sense of humor, requires ruby to compile)</li>
70 <p>As a standalone HTTP, server, you will also need:
73 <li><a href="http://www.gnu.org/software/libmicrohttpd/">GNU libmicrohttpd</a> >= 0.9.3</li>
76 <p>Or, as a CGI, you will need:
79 <li><a href="http://www.newbreedsoftware.com/cgi-util/download/">cgi-util</a> >= 2.2.1</li>
82 <h3><a name="configuring">Configuring</a></h3>
84 <p>I know I'm gonna get shit for not using an autoconf-based system, but
85 I really didn't want to spend time figuring it out. You should edit
86 libs.mk and put in the paths where you can find headers and libraries
87 for the above requirements.
89 <p>Also, further apologies to BSD folks — I've probably committed
90 several unconscious Linux-isms. It would not surprise me if the
91 makefile refuses to work with BSD make, or if it fails to compile even
92 with gmake. If you have patches or suggestions on how to make Blërg
93 more portable, I'd be happy to hear them.
95 <h3><a name="building">Building</a></h3>
97 <p>At this point, it should be gravy. Type 'make' and in a few seconds,
98 you should have <code>blerg.httpd</code>, <code>blerg.cgi</code>,
99 <code>rss.cgi</code>, and <code>blergtool</code>. Each of those can be
100 made individually as well, if you, for example, don't want to install
101 the prerequisites for <code>blerg.httpd</code> or
102 <code>blerg.cgi</code>.
104 <h3><a name="installing">Installing</a></h3>
106 <p>While it's not strictly required, Blërg will be easier to set up if
107 you configure it to work from the root of your website. For this
108 reason, it's better to use a subdomain (i.e., blerg.yoursite.com is
109 easier than yoursite.com/blerg/). If you do want to put it in a
110 subdirectory, you will have to modify <code>www/js/blerg.js</code> and
111 change baseURL at the top as well as a number of other self-references
112 in that file and <code>www/index.html</code>. The CGI version should
113 work fine this way, but the HTTP version will require the request to be
114 rewritten, as it expects to be serving from the root.
116 <p>You cannot serve the database and client from different domains
117 (i.e., yoursite.com vs othersite.net, or even foo.yoursite.com and
118 bar.yoursite.com). This is a requirement of the web browser — the
119 same origin policy will not allow an AJAX request to travel across
122 <h4>For the standalone web server:</h4>
124 <p>Right now, <code>blerg.httpd</code> doesn't serve any static assets,
125 so you're going to have to put it behind a real webserver like apache,
126 lighttpd, nginx, or similar. Set the document root to the www
127 directory, then proxy /info, /create, /login, /logout, /get, /tag, and
128 /put to blerg.httpd. You can change the port <code>blerg.httpd</code>
129 listens on in <code>config.h</code>.
131 <h4>For the CGI version:</h4>
133 <p>Copy the files in www/ to the root of your web server. Copy
134 <code>blerg.cgi</code> to your web server. Included in www-configs/ is
135 a .htaccess file for Apache that will rewrite the URLs. If you need to
136 call the CGI something other than <code>blerg.cgi</code>, the .htaccess
137 file will need to be modified.
139 <h4>The extra RSS CGI</h4>
141 <p>There is an optional RSS cgi (<code>rss.cgi</code>) that will serve
142 RSS feeds for users. Install this like <code>blerg.cgi</code> above.
145 <h2><a name="api">API</a></h2>
147 <p>Blërg's API was designed to be as simple as possible. Data sent from
148 the client is POSTed with the application/x-www-form-urlencoded
149 encoding, and a successful response is always JSON. The API endpoints
150 will be described as though the server were serving requests from the
153 <h3><a name="api_definitions">API Definitions</a></h3>
155 <p>On failure, all API calls return either a standard HTTP error
156 response, like 404 Not Found if a record or user doesn't exist, or a 200
157 response with a 'JSON failure', which will look like this:
159 <p><code>{"status": "failure"}</code>
161 <p>Blërg doesn't currently explain <i>why</i> there is a failure, and
162 I'm not sure it ever will.
164 <p>On success, you'll either get some JSON relating to your request (for
165 /get, /tag, or /info), or a 'JSON success' response (for /create, /put,
166 /login, or /logout), which looks like this:
168 <p><code>{"status": "success"}</code>
170 <p>For the CGI backend, you may get a 500 error if something goes wrong.
171 For the HTTP backend, you'll get nothing (since it will have crashed),
172 or maybe a 502 Bad Gateway if you have it behind another web server.
174 <p>All usernames must be 32 characters or less. Usernames must contain
175 only the ASCII characters 0-9, A-Z, a-z, underscore (_), period (.),
176 hyphen (-), single quote ('), and space ( ). Passwords can be at most
177 64 bytes, and have no limits on characters (but beware: if you have a
178 null in the middle, it will stop checking there because I use
179 <code>strncmp(3)</code> to compare).
181 <p>Tags must be 64 characters or less, and can contain only the ASCII
182 characters 0-9, A-Z, a-z, hyphen (-), and underscore (_).
184 <h3><a name="api_create">/create</a> - create a new user</a></h3>
186 <p>To create a user, POST to /create with <code>username</code> and
187 <code>password</code> parameters for the new user. The server will
188 respond with JSON failure if the user exists, or if the user can't be
189 created for some other reason. The server will respond with JSON
190 success if the user is created.
192 <h3><a name="api_login">/login</a> - log in</a></h3>
194 <p>POST to /login with the <code>username</code> and
195 <code>password</code> parameters for an existing user. The server will
196 respond with JSON failure if the user does not exist or if the password
197 is incorrect. On success, the server will respond with JSON success,
198 and will set a cookie named 'auth' that must be sent by the client when
199 accessing restricted API functions (/put and /logout).
201 <h3><a name="api_logout">/logout</a> - log out</a></h3>
203 <p>POST to /logout with with <code>username</code>, the user to log out,
204 along with the auth cookie in a Cookie header. The server will respond
205 with JSON failure if the user does not exist or if the auth cookie is
206 bad. The server will respond with JSON success after the user is
207 successfully logged out.
209 <h3><a name="api_put">/put</a> - add a new record</a></h3>
211 <p>POST to /put with <code>username</code> and <code>data</code>
212 parameters, and an auth cookie. The server will respond with JSON
213 failure if the auth cookie is bad, if the user doesn't exist, or if
214 <code>data</code> contains more than 65535 bytes <i>after</i> URL
215 decoding. The server will respond with JSON success after the record is
218 <h3><a name="api_get">/get/(user), /get/(user)/(start record)-(end record)</a> - get records for a user</a></h3>
220 <p>A GET request to /get/(user), where (user) is the user desired, will
221 return the last 50 records for that user in a list of objects. The
222 record objects look like this:
227 "timestamp":1294309438,
228 "data":"eatin a taco on fifth street"
232 <p><code>record</code> is the record number, <code>timestamp</code> is
233 the UNIX epoch timestamp (i.e., the number of seconds since Jan 1 1970
234 00:00:00 GMT), and <code>data</code> is the content of the record. The
235 record number is sent as a string because while Blërg supports record
236 numbers up to 2<sup>64</sup> - 1, Javascript uses floating point for all
237 its numbers, and can only support integers without truncation up to
238 2<sup>53</sup>. This difference is largely academic, but I didn't want
239 this problem to sneak up on anyone who is more insane than I am. :]
241 <p>The second form, /get/(user)/(start record)-(end record), retrieves a
242 specific range of records, from (start record) to (end record)
243 inclusive. You can retrieve at most 100 records this way. If (end
244 record) - (start record) specifies more than 100 records, or if the
245 range specifies invalid records, or if the end record is before the
246 start record, the server will respond with JSON failure.
248 <h3><a name="api_info">/info/(user)</a> - Get information about a user</a></h3>
250 <p>A GET request to /info/(user) will return a JSON object with
251 information about the user (currently only the number of records). The
252 info object looks like this:
256 "record_count": "544"
260 <p>Again, the record count is sent as a string for 64-bit safety.
262 <h3><a name="api_tag">/tag/(#|H|@)(tagname)</a> - Retrieve records containing tags</a></h3>
264 <p>A GET request to this endpoint will return the last 50 records
265 associated with the given tag. The first character is either # or H for
266 hashtags, or @ for mentions (I call them ref tags). You should URL
267 encode the # or @, lest some servers complain at you. The H alias for #
268 was created because Apache helpfully strips the fragment of a URL
269 (everything from the # to the end) before handing it off to the CGI,
270 even if the hash is URL encoded. The record objects also contain an
271 extra <code>author</code> field, like so:
277 "timestamp":1294555793,
278 "data":"I'm taking #garfield to the vet."
282 <p>There is currently no support for getting more than 50 tags, but /tag
283 will probably mutate to work like /get.
285 <h2><a name="design">Design</a></h2>
287 <h3><a name="motivation">Motivation</a></h3>
289 <p>Blërg was created as the result of a thought experiment: "What if
290 Twitter didn't need thousands of servers? What if its millions of users
291 could be handled by a single highly efficient server?" This is probably
292 an unreachable goal due to the sheer amount of I/O, but we can certainly
293 try to do better. Blërg was thus designed as a system with very simple
297 <li>Store and fetch small chunks of text efficiently</li>
298 <li>Create fast indexes for hash tags and @ mentions</li>
299 <li>Provide a HTTP interface web apps can use</li>
302 <p>And to further simplify, I didn't bother handling deletes, full text
303 search, or more complicated tag searches. Blërg only does the basics.
305 <h3><a name="web_app_stack">Web App Stack</a></h3>
307 <table class="pizzapie">
308 <tr><th>Classical model</th></tr>
310 <td style="background-color: blue; color: white"><b>Client App</b><br>HTML/Javascript</td>
313 <td style="background-color: #9F0000; color: white"><b>Webserver</b><br>Apache, lighttpd, nginx, etc.</td>
316 <td style="background-color: #009F00; color: white"><b>Server App</b><br>Python, Perl, Ruby, etc.</td>
319 <td style="background-color: #404040; color: white"><b>Database</b><br>MySQL, PostgreSQL, MongoDB, CouchDB, etc.</td>
323 <p>Modern web applications have at least a four-layer approach. You
324 have the client-side browser app, the web server, the server-side
325 application, and the database. Your data goes through a lot of layers
326 before it actually resides on disk somewhere (or, as they're calling it
327 these days, "The Cloud" *waves hands*). Each of those layers requires
328 some amount of computing resources, so to increase throughput, we must
329 make the layers more efficient, or reduce the number of layers.
331 <table class="pizzapie">
332 <tr><th>Blërg model</th></tr>
334 <td style="background-color: blue; color: white"><b>Blërg Client App</b><br>HTML/Javascript</td>
337 <td style="background-color: #404040; color: white"><b>Blërg Database</b><br>Fuckin' hardcore C and shit</td>
341 <p>Blërg does both by smashing the last two or three layers into one
342 application. Blërg can be run as either a standalone web server, or as
343 a CGI (FastCGI support is planned, but I just don't care right now).
344 Less waste, more throughput. As a consequence of this, the entirety of
345 the application logic that the user sees is implemented in the client
346 app in Javascript. That's why all the URLs have #'s — the page is
347 loaded once and switched on the fly to show different views, further
348 reducing load on the server. Even parsing hash tags and URLs are done
351 <p>The API is simple and pragmatic. It's not entirely RESTful, but is
352 rather designed to work well with web-based front-ends. Client data is
353 always POSTed with the usual application/x-www-form-urlencoded encoding,
354 and server data is always returned in JSON format.
356 <p>The HTTP interface to the database idea has already been done by <a
357 href="http://couchdb.apache.org/">CouchDB</a>, though I didn't know that
358 until after I wrote Blërg. :)
360 <h3><a name="database">Database</a></h3>
362 <p>I was impressed by <a
363 href="http://www.varnish-cache.org/">varnish</a>'s design, so I decided
364 early in the design process that I'd try out mmaped I/O. Each user in
365 Blërg has their own database, which consists of one or more data and
366 index files, and a metadata file. When a database is opened, only the
367 metadata is actually read (currently a single 64-bit integer keeping
368 track of the last record id). The data and index files are memory
369 mapped, which hopefully makes things more efficient by letting the OS
370 handle when to read from disk. The index files are preallocated because
371 I believe it's more efficient than writing to it 40 bytes at a time as
372 records are added. The database's limits are reasonable:
374 <table class="statistics">
375 <tr><td>maximum record size</td><td>65535 bytes</td></tr>
376 <tr><td>maximum number of records per database</td><td>2<sup>64</sup> - 1 bytes</td></tr>
377 <tr><td>maximum number of tags per record</td><td>1024</td></tr>
380 <p>So as not to create grossly huge and unwieldy data files, the
381 database layer splits data and index files into many "segments"
382 containing at most 64K entries each. Those of you doing some quick math
383 in your heads may note that this could cause a problem on 32-bit
384 machines — if a full segment contains entries of the maximum
385 length, you'll have to mmap 4GB (32-bit Linux gives each process only
386 3GB of virtual address space). Right now, 32-bit users should change
387 <code>RECORDS_PER_SEGMENT</code> in <code>config.h</code> to something
388 lower like 32768. In the future, I might do something smart like not
389 mmaping the whole fracking file.
391 <table class="bitstructure">
392 <tr><th>Record Index Structure</th></tr>
393 <tr><td class="B4">offset (32-bit integer)</td></tr>
394 <tr><td class="B2">length (16-bit integer)</td></tr>
395 <tr><td class="B2">flags (16-bit integer)</td></tr>
396 <tr><td class="B4">timestamp (32-bit integer)</td></tr>
399 <p>A record is stored by first appending the data to the data file, then
400 writing an entry in the index file containing the offset and length of
401 the data, as well as the timestamp. Since each index entry is fixed
402 length, we can find the index entry simply by multiplying the record
403 number we want by the size of the index entry. Upshot: constant-time
404 random-access reads and constant-time writes. As an added bonus,
405 because we're using append-only files, we get lockless reads.
407 <table class="bitstructure">
408 <tr><th>Tag Structure</th></tr>
409 <tr><td class="B32">username (32 bytes)</td></tr>
410 <tr><td class="B8">record number (64-bit integer)</td></tr>
413 <p>Tags are handled by a separate set of indices, one per tag. When a
414 record is added, it is scanned for tags, then entries are appended to
415 each tag index for the tags found. Each index record simply stores the
416 user and record number. Tags are searched by opening the tag file,
417 reading the last 50 entries or so, and then reading all the records
418 listed. Voila, fast tag lookups.
420 <p>At this point, you're probably thinking, "Is that it?" Yep, that's
421 it. Blërg isn't revolutionary, it's just a system whose requirements
422 were pared down until the implementation could be made dead simple.
424 <p>Also, keeping with the style of modern object databases, I haven't
425 implemented any data safety (har har). Blërg does not sync anything to
426 disk before returning success. This should make Blërg extremely fast,
427 and totally unreliable in a crash. But that's the way you want it,
430 <h3><a name="problems">Problems, Caveats, and Future Work</a></h3>
432 <p>Blërg probably doesn't actually work like Twitter because I've never
433 actually had a Twitter account.
435 <p>I couldn't find a really good fast HTTP server library.
436 Libmicrohttpd is small, but it's focused on embedded applications, so it
437 often eschews speed for small memory footprint. This is especially
438 apparent when you watch it chew through a POST request 300 bytes at a
439 time even though you've specified a buffer size of 256K.
440 <code>blerg.httpd</code> is still pretty fast this way — on my
442 href="http://www.joedog.org/index/siege-home">siege</a> says it serves a
443 690-byte /get request at about 945 transactions per second, average
444 response time 0.05 seconds, with 100 concurrent accesses — but a
445 fast HTTP server implementation could knock this out of the park.
447 <p>Libmicrohttpd is also really difficult to work with. If you look at
448 the code, <code>http_blerg.c</code> is about 70% longer than
449 <code>cgi_blerg.c</code> simply because of all the iterator hoops I had
450 to jump through to process POST requests. And if you can believe it, I
451 wrote <code>http_blerg.c</code> first. If I'd done it the other way
452 around, I probably would have given up on libmicrohttpd. :-/
454 <p>The data structures written to disk are dependent on the size and
455 endianness of the primitive data types on your architecture and OS.
456 This means that the databases are not portable. A dump/import tool is
457 probably the easiest way to handle this.
459 <p>I do want to make a FastCGI version eventually, and this will
460 probably be a rather simple modification of cgi_blerg.
462 <p>Implementing deletes will be... interesting. There is room in the
463 record index for a 'deleted' flag, but the problem is deleting any tags
464 referenced in the data. This requires rescanning the record content and
465 putting a 'deleted' flag in the tag indices. This will not be pretty,
466 so I'm just going to ignore it and hope nobody makes any mistakes. ;]
468 <p>Tag indices can grow arbitrarily large, which will cause problems for
469 32-bit machines around the 3GB mark. Still, that's something like 80
470 million tags, so maybe it's not something to worry about.
472 <p>The API currently requires the client to transmit the user's password
473 in the clear. A digest-based authentication scheme would be better,
474 though for real security, the app should run over HTTPS.