Sunday, December 2, 2012

The power of column stores

  • using column stores instead of row based stores can reduce access logs from 10 GB to 130 MB of disk space
  • reading compressed log files is 4 times faster than reading uncompressed files from hard disk
  • column stores can speed up analytical queries by a factor of 18-58

Normally, log files from a web server are stored in a single file. For archiving, log files get compressed with gzip. A typical line in a log file represents one request and looks like this:

173.15.3.XXX - - [30/May/2012:00:37:35 +0200] "GET /cms/ext/files/Sgs01Thumbs/sgs_pmwiki2.jpg HTTP/1.1" 200 14241 "http://www.simple-groupware.de/cms/ManualPrint" "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0"
Compression speeds up reading the log file:
$start = microtime(true);
$fp = gzopen("httpd.log.gz", "r");
while (!gzeof($fp)) gzread($fp, 8192);
gzclose($fp);
echo (microtime(true)-$start)."s\n"; // 26s

$start = microtime(true);
$fp = fopen("httpd.log", "r");
while (!feof($fp)) fread($fp, 8192);
fclose($fp);
echo (microtime(true)-$start)."s\n"; // 105s
(PHP 5.4.5, 2.5 GHz, hard disk with 7200rpm)

Having a log file of 10 GB gives a compressed file with 600 MB using gzip. This is already quite good, but can we make it better?

In the example line, we have different attributes (=columns) separated by " " and []. For example:

ip=173.15.3.XXX
date=30/May/2012:00:37:35 +0200
status=200
length=14241
url=http://www.simple-groupware.de/cms/ManualPrint
agent=Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0
etc.

When we save each column in a separate file, we will get smaller files after the compression. Using a file for the HTTP status code column contains many similar values (mostly 200), which allows better compression.

Coming from 600 MB, we can reduce the total size to 280 MB by saving each column to a different file.

Analyzing files containing only one column is also much easier. For example, we can do a group by over all status codes on the shell with:

time zcat status.log.gz | sort | uniq -c
36689940 200
  11880 206
 124560 301
1142820 302
 968040 304
   3600 401
  37080 403
 784260 404
    180 405
   6480 408

real    0m31.600s
user    0m27.038s
sys     0m1.436s
Note: instead of sorting, it will be faster to fill a small hash table with the counts.

Counting is also very fast:

# get number of requests with status code equal to 200
time zcat status.log.gz | grep -E "^200$" | wc -l
36689940

real    0m3.078s
user    0m2.944s
sys     0m0.128s

# get number of requests with status code not equal to 200
time zcat status.log.gz | grep -Ev ^200$ | wc -l
3078900

real    0m1.799s
user    0m1.736s
sys     0m0.060s

Summing:

# sum up all transferred bytes (5.5951e+11 ~ 521 GB)
time zcat length.log.gz | awk '{s+=$1}END{print s}'
5.5951e+11

real    0m5.708s
user    0m5.556s
sys     0m0.144s

The biggest column is normally the one containing URLs (118 MB in our example). We can reduce the size by assigning a unique ID to each URL and save the list of URLs in a separate file.

Coming from 280 MB, we can reduce the total size to 130 MB by splitting URLs, referers and user agents into 2 files.

Here is the code for splitting a log file into column based files:
$urls = [];
$agents = [];
$split_urls = true;
$split_agents = true;

$fp_ip = gzopen("ip.log.gz", "w");
$fp_user = gzopen("user.log.gz", "w");
$fp_client = gzopen("client.log.gz", "w");
$fp_date = gzopen("date.log.gz", "w");
$fp_url = gzopen("urls.log.gz", "w");
$fp_url_list = gzopen("urls_list.log.gz", "w");
$fp_status = gzopen("status.log.gz", "w");
$fp_length = gzopen("length.log.gz", "w");
$fp_referer = gzopen("referer.log.gz", "w");
$fp_agent = gzopen("agent.log.gz", "w");
$fp_agent_list = gzopen("agent_list.log.gz", "w");

$fp = gzopen("httpd.log.gz", "r");
while (!gzeof($fp)) {
  $line = gzgets($fp, 8192);
  preg_match("!^([0-9\.]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] \"(?:GET )?([^\"]*)".
    " HTTP/1\.[01]\" ([^ ]+) ([^ ]+) \"([^\"]*)\" \"([^\"]*)\"\$!", $line, $m);
  if (empty($m)) {
    echo $line." ###"; // output broken lines
    continue;
  }
  gzwrite($fp_ip, $m[1]."\n");
  gzwrite($fp_user, ($m[2]=="-" ? "" : $m[2])."\n");
  gzwrite($fp_client, ($m[3]=="-" ? "" : $m[3])."\n");
  gzwrite($fp_date, $m[4]."\n");

  if ($split_urls) {
    if (!isset($urls[$m[5]])) $urls[$m[5]] = count($urls)-1;
    gzwrite($fp_url, $urls[$m[5]]."\n");
  } else {
    gzwrite($fp_url, $m[5]."\n");
  }
  gzwrite($fp_status, $m[6]."\n");
  gzwrite($fp_length, ($m[7]=="-" ? "0" : $m[7])."\n");
  
  if ($m[8]=="-") {
    gzwrite($fp_referer, "\n");
  } else if ($split_urls) {
    if (!isset($urls[$m[8]])) $urls[$m[8]] = count($urls)-1;
    gzwrite($fp_referer, $urls[$m[8]]."\n");
  } else {
    gzwrite($fp_referer, $m[8]."\n");
  }
  if ($m[9]=="-") {
    gzwrite($fp_agent, "\n");
  } else if ($split_agents) {
    if (!isset($agents[$m[9]])) $agents[$m[9]] = count($agents)-1;
    gzwrite($fp_agent, $agents[$m[9]]."\n");
  } else {
    gzwrite($fp_agent, $m[9]."\n");
  }
}
gzwrite($fp_url_list, implode("\n", array_keys($urls)));
gzwrite($fp_agent_list, implode("\n", array_keys($agents)));

gzclose($fp);
gzclose($fp_ip);
gzclose($fp_user);
gzclose($fp_client);
gzclose($fp_date);
gzclose($fp_url);
gzclose($fp_url_list);
gzclose($fp_status);
gzclose($fp_length);
gzclose($fp_referer);
gzclose($fp_agent);
gzclose($fp_agent_list);
Note: Reading data from disk is always slower than analyzing data in real-time when the data is in memory.

1 comment:

  1. thanks you
    please , visit this blog for teach yourself
    www.france244.blogspot.com

    ReplyDelete

Labels

performance (23) benchmark (6) MySQL (5) architecture (5) coding style (5) memory usage (5) HHVM (4) C++ (3) Java (3) Javascript (3) MVC (3) SQL (3) abstraction layer (3) framework (3) maintenance (3) Go (2) Golang (2) HTML5 (2) ORM (2) PDF (2) Slim (2) Symfony (2) Zend Framework (2) Zephir (2) firewall (2) log files (2) loops (2) quality (2) real-time (2) scrum (2) streaming (2) AOP (1) Apache (1) Arrays (1) C (1) DDoS (1) Deployment (1) DoS (1) Dropbox (1) HTML to PDF (1) HipHop (1) OCR (1) OOP (1) Objects (1) PDO (1) PHP extension (1) PhantomJS (1) SPL (1) SQLite (1) Server-Sent Events (1) Silex (1) Smarty (1) SplFixedArray (1) Unicode (1) V8 (1) analytics (1) annotations (1) apc (1) archiving (1) autoloading (1) awk (1) caching (1) code quality (1) column store (1) common mistakes (1) configuration (1) controller (1) decisions (1) design patterns (1) disk space (1) dynamic routing (1) file cache (1) garbage collector (1) good developer (1) html2pdf (1) internationalization (1) invoice (1) just-in-time compiler (1) kiss (1) knockd (1) legacy code (1) legacy systems (1) logtop (1) memcache (1) memcached (1) micro framework (1) ncat (1) node.js (1) openssh (1) pfff (1) php7 (1) phpng (1) procedure models (1) ramdisk (1) recursion (1) refactoring (1) references (1) regular expressions (1) search (1) security (1) sgrep (1) shm (1) sorting (1) spatch (1) ssh (1) strange behavior (1) swig (1) template engine (1) threads (1) translation (1) ubuntu (1) ufw (1) web server (1) whois (1)