Queueing systems manage and process queues of jobs. The presence of a queue implies that something cannot be done immediately and one has to wait until some resource is available. When you respond to an HTTP request you usually want to do it interactively, that is, you want to respond within some reasonable amount of time. What can prevent you from doing this? Various things: external data sources, long-running tasks. Your request might consume too much memory and the system might decide to swap out something, which is time-consuming.
From web app’s point of view external data sources are out of influence: there is no way you can make external data sources respond faster other than by making external data sources respond faster. For example, if you are unsatisfied with how long a MySQL database responds to certain query, fixing this requires changing something in the database. Fixing the query would produce a new query, which is a different story.
When there is a long running task, however, and it cannot be redesigned to run interactively, Continue reading
Few months ago I shut down the pages of nginx socketlog module and nginx redislog module. This is due to excessive volume of support they required. Some people, however, found that these are interesting pieces of technology and I got several requests to release them.
Today I release these modules under free licence. Meet the released nginx socketlog module and nginx redislog module!
So how do you store a lot of data if there is already over your head? The simplest answer is: partition horizontally, in other words divide and conquer. In horizontal partitioning each data instance is stored on a single storage node, so you can increase your total capacity by adding more nodes.
For horizontal partitioning to work data instances have to have no dependency between each other. If this condition is satisfied, there is no need to visit multiple storage nodes in order to reconcile a single data instance.
Sounds easy, ah? But not as it seems… Continue reading
Long time ago I disabled comments because I was getting a lot of spam. Apparently this was man-made spam, because they did go through CAPTCHA. Today I installed Social and also added Twitter and Facebook accounts.
Please comment and follow!
Recently I was looking into making my NLP knowledge more solid and I found this book by reference: Foundations of Statistical Natural Language Processing. It’s a classic book and certainly it was a good read.
Now, the topics it discusses might sound quite theoretical, so let me translate them to few examples how each of them could be applied in your work.
I wanted to write an article about secrets of scalability, but it appears that this subject is too complex for one article. Instead let’s just dissect some scalability problems as we go.
When you think about scalability, it is important to distinguish two different types of problems: those that require reading much more often than updating and those that require reading as often or even less often than updating. First type of problems is called WORM (write once read many), second is called RW (read-write). It turns out that they are fundamentally different and here is why. Continue reading
Correct me if I’m wrong, but it seems that this paper proves optimality of “multi-armed bandit” approach to A/B testing. The latter one was described in this post earlier this year.
For those who do not understand what it is about: A/B testing requires investment in the form of sample size (usually it is equal to number of unique users), which is time and money. “Multi-armed bandit” approach is about optimising this investment.
I wouldn’t say you’re ancient if you aren’t doing it already, but it’s interesting to see how abstract science creates new opportunities for business.
In version 1.0.2 of redislog module I added a feature that allows you to do conditional logging. What can you do with it? For example, logging only unique visitors. E.g.:
access_redislog test "$server_name$redislog_yyyymmdd" ifnot=$uid_got;
$uid_got becomes empty whenever a visitor doesn’t have an UID cookie. Therefore, this directive effectively logs all hits of unique visitors. You can populate a list (one per virtual host and day or hour) with unique visitor records and count them with LLEN. For that just use LPUSH command instead of default APPEND. Details could be found in the manual.
Somehow the problem of logging was not completely addressed in nginx. I managed to implement 2 solutions for remote logging: nginx socketlog module and nginx redislog module. Each of these modules maintain one permanent connection per worker to logging peer (BSD syslog server or redis database server). Messages are buffered in 200k buffer even when logging peer is offline and pushed to logging peer as soon as possible.
If logging connection interrupts, these modules try to reestablish it periodically and if successful, buffered messages get flushed to remote. That is, if logging peer closes connection gracefully, you can restart it without restarting nginx.
In addition to that, redis logging module is able to generate destination key names dynamically, so you can do some interesting tricks with it, e.g. having one log file per server or per IP per day.
Take a look at the manual in order to get the idea of how it works.
In one of the previous articles I discussed the basics of HTTP modules. As the power of Nginx comes from configuration files, you definitely need to know how to configure your module and make it ready for variety of environments that users of your module can have. Here is how you can define configuration directives. Continue reading