Archive for

April, 2011

...

Analysing Apache Logs: gnuplot and awk

no comments

The Apache http logs

I wanted to make a graph on the amount of data served from by Apache server, with a bit finer granularity than AWStats could give. The http_access file has all the information I needed, including the time of each request and bytes served. Assuming the standard combined format, the time stamp is at the 4th field, and the bytes served at the 10th.

Thus, the following will isolate the necessary data for my graph. (Note, the log can usually be found at /var/log/httpd/access_log).

cat /tmp/access | cut -f 4,10 -d ' '

However, it turns out not all log entries store the bytes served. This includes file not found, and certain requests which return no data. Some cases will have a hyphen, while others will simply be blank. To pick out only the lines which contained data, I appended the line above with:

cat /tmp/access | cut -f 4,10 -d ' ' | egrep ".* [0-9]+"

The first plot

This is enough to start working with in gnuplot. First we have to set the time format of the x-axis. The Apache log file is on this format: “[10/Oct/2000:13:55:36", or in terms of strftime(3) format: "[%d/%b/%Y:%H:%M:%S". (Note that the opening bracket from the log is included in the formatting string).

To set the time format in gnuplot, and furthermore specify that we work with time on the x-axis:
set timefmt "[%d/%b/%Y:%H:%M:%S"
set xdata time

The data can then be plotted with the following command:
plot "< cat /tmp/access | cut -f 4,10 -d ' ' | egrep '.* [0-9]+’” using 1:2

To output to file, the following will do. The graph below shows the served files from my logs in the last couple of days.
set terminal png size 600,200
set output "/tmp/gnuplot_first.png"

First plot

Improvements
There are a few improvements to be made on the graph above: Most importantly the data is slightly misleading, since files served at the same time is not accumulated. Furthermore, the aesthetics like legend, axis units, and title formatting are missing. Also note that the graph is scaled to a few outliers: I have a 7 MB video on my blog, which is downloaded occasionally. For the following examples, I will focus on the first day, where this file is not included.

First, I've made some minor improvements, and in the second graph I've applied the "frequency" smoothing function. Notice how the first graph has a maximum around 440 kb, while the smoothed and accumulated graph below peaks at around 900.
set terminal png size 600,250
set xtics rotate
set xrange [:"[24/Apr/2011:22"]

plot "< cat /tmp/access | cut -f 4,10 -d ' ' | egrep '.* [0-9]+'" using 1:($2/1000) title "kb" with points

plot "< cat /tmp/access | cut -f 4,10 -d ' ' | egrep '.* [0-9]+'" using 1:($2/1000) title "kb" smooth frequency with points

Improved

Frequency smoothing

awk
Although the frequency smoothing function gives an accurate picture, some of the accumulations are done at a too wide range, thus giving the impression of higher load than is the case. Another way to sum up the data is to aggregate all request on the same second into a sum. This can be done with the following awk script:

awk '{ date=$1; if (date==olddate) sum=sum+$2; else { if (olddate!="") {print olddate,sum}; olddate=date; sum=$2}} END {print date,sum}'

The input still has to be scrubbed, so the final line looks like this:
cat /tmp/access | cut -f 4,10 -d ' ' | egrep ".* [0-9]+$" | awk '{ date=$1; if (date==olddate) sum=sum+$2; else { if (olddate!="") {print olddate,sum}; olddate=date; sum=$2}} END {print date,sum}' > /tmp/access_awk

Plotting these two functions in the same graphs shows the difference between the peaks of the frequency function, and the simple aggregation:
plot "< cat /tmp/access | cut -f 4,10 -d ' ' | egrep '.* [0-9]+'" using 1:($2/1000) title "frequency" smooth frequency with points, "/tmp/access_awk" using 1:($2/1000) title "awk" with points lt 0

awk and frequency smoothing

Moving average in Gnuplot
For the daily graph, I think I'd prefer the one using the awk output, and perhaps using lines or "impulses" as style instead. However, it does not address the outliers. To smooth them out, we could try a moving average. This is not supported by any native function in gnuplot, so we have to roll our own. Thanks to Ethan A Merritt, there is an example of this.

Of course, this will put a lot less emphasis on peaks, and the outlier at 650 kb in the graphs above is now represented with a spike of less than 200. Furthermore, there is a problem with the moving average of time data of inconsistent frequency. The values will be the same whether the last five request were over an hour or a few seconds.

samples(x) = $0 > 4 ? 5 : ($0+1)
avg5(x) = (shift5(x), (back1+back2+back3+back4+back5)/samples($0))
shift5(x) = (back5 = back4, back4 = back3, back3 = back2, back2 = back1, back1 = x)
init(x) = (back1 = back2 = back3 = back4 = back5 = 0)

plot init(0) notitle, "/tmp/access_awk" using 1:(avg5($2/1000)) title "awk & avg5" with lines lt 1

moving avg

Zooming out to the day view, the average is maybe more appropriate here, since data is overall on a more consistent frequency.

set xrange [*:*]
set format x "%d"

plot "/tmp/access_awk" using 1:(avg5($2/1000)) title "awk & avg5" with lines

full graph, moving avg

Cumulative
Finally, another interesting view is the cumulative output day by day. This can easily be achieved by inserting a blank line in the data file between each day. In awk, using the previous sum file generated above, it can be done like this:

cat /tmp/access_awk | awk 'BEGIN { FS = ":" } ; { date=$1; if (date==olddate) print $0; else { print ""; print $0; olddate=date}}' > /tmp/access_awk_days

Or an alternative, based on the original access_log file. The aggregation per second this not necessary, since the "cumulative" function will do the same operation, and the graph will be exactly the same:
cat /tmp/access | cut -f 4,10 -d ' ' | egrep ".* [0-9]+$" | awk 'BEGIN { FS = ":" } ; { date=$1; if (date==olddate) print $0; else { print ""; print $0; olddate=date}}' > /tmp/access_awk_days

And the gnuplot. Note that the tics on the x-axis are set manually here, starting on a day before the first day in plot, and ending on the last. The increment is set to a bit less than a day in seconds (60 * 60 * 24 = 86400) to approximately center it under each line. Also note, that the format of the start and end arguments still have to be the same as set in the beginning, with timefmt.
set xtics "[23/Apr/2011:0:0:0", 76400, "[29/Apr/2011:23:59:59"
set format x "%d"

plot "/tmp/access_awk_days" using 1:($2/1000000) title "cumulative (MB)" smooth cumulative

cumulative

Photo Textures: Flypaper Textures

no comments

I recently came across the photo textures from Flypaper Textures, and they seem to have attracted a fan base. Used to add “depth and interest” in a picture, applying an overlay texture creates a more painterly effect (appearing as painted with a brush), and sometimes an impressionists like style. Some of the examples from Flypaper Textures are perhaps a bit on the over-saturated end, however will definitely stick in a Flickr stream.

Although the basic principle is simple; overlay an image with less than 100% opacity, some of the tutorials available are interesting: from Flypaper Textures itself, CoffeShop blog, and a video from Digiscrap101.

Finally, here’s a similar tutorial in GIMP, focusing on the layers features. His example is a lot less subtle though, but the same principle and functions apply.

Before After

555 Contest

no comments

Evil Mad Scientist is showing some of the entries for the 555 Contest. There’s some great creations, but perhaps the most entertaining is the “Le Dominoux” (or “LED dominoes”) by Randy Elwin. Through a series of many light receivers & emitters, he creates a domino effects with lights. And when put in a loop, it “self sustains” the cascading effect. Genius!

Sparklines

no comments

Looking for a Python sparkline library, I found Perry Geo’s excellent code. “In the minimalist spirit of sparklines, the interface was kept simple”:

import spark
a = [32.5,35.2,39.9,40.8,43.9,48.2,50.5,51.9,53.1,55.9,60.7,64.4]
spark.sparkline_smooth(a).show()

That’s it, and here’s the result. Just download his single Python module, start up interactive Python, and off you go.

This of course sent me on a tangent, off to Edward Tufte’s work and creation of sparklines. It seems I have a book or two to buy.

Strahov Library 40 Gigapixels

no comments

Jeffrey Martin, founder of 360Cites, recently released a 40 GP indoor panorama of the Strahov Library in Prague. It claims to be the world’s largest indoor panorama. It consists of 2947 shots, which combine to the 280,000 x 140,000 pixels, and 280 GB image.

You can view it here, but be aware that the Flash application and pictures can take quite some time to load. I also have the Flash crash several times.

The TC article mentions the he used a Canon 550D and a 200mm lens. It is also covered by Wired, and from their picture of the setup, it seems to be a Canon EF 70-200mm f/2.8 L USM. The Canon 550D is a 18 MP camera, which means 2947 input images gives a total of 53 GP raw data. Furthermore, he uses RAW files, at around 20 – 25 MB echo, so that would take up 59 to 73 GB on the card. (Thus, the 280 GB number above seems a bit strange).

Furthermore, it intersecting to note hat he uses the GigaPanBot by T. Emrich from Germany. I wrote about his project in November last year, and got the impression it was more of a hobby project. It seems he has made a nice niche business for himself.

In the Wired article, they mention that the camera does not always get focus, so Jeffrey has to jump up, pause the robot, fix the focus, and continue. It also says that on the first day, he managed to finish about 20% of the job before the library closed at 5 pm. It doesn’t say how long it took to complete, but at that rate it would take a week! After that, it took 111 hours to stitch everything together, and about 10 hours of work to fix misaligned images.

Teensy USB Development Board

no comments

The Teensy is a complete USB-based microcontoller development system, in a very small footprint. It is similar to Arduino boards in that it uses an ATMEGA Atmel AVR processor, however not compatible. Even so, there is a software add-on for the Arduino IDE. They claim the USB communication is faster than the Arduino, since the the later uses USB-to-serial communication. As far as I understand, that is also the case with the new Arduino Uno.

The Teensy boards are much smaller than the basic Arduino boards, but still features a mini USB port. It is also available with header pins to fit on a bread board. PJRC lists a number of fun projects using the Teensy.

Storage Prices

1 comment

It’s only about two months since the last update on storage prices, so not much has happened. The different SSD offerings are difficult to track, though. Anandtech laments how Kingston has a confusing SSD lineup, with six different parallel models. However, the other manufactures are not easy to follow either.

This update adds more of the smaller SSD. I might get a 60 GB one before the 128 GB ones come down below a reasonable level.

Media Type Product Capacity Price CHF Price Euros Euros / GB GBs / Euro
Harddisk Western Digital Caviar Green 2TB 2000 GB 92.00 70.04 0.04 28.55
Harddisk Western Digital Caviar Green 1.5TB 1500 GB 74.00 56.34 0.04 26.62
External 3.5 Western Digital Elements Desktop 2TB 2000 GB 99.00 75.37 0.04 26.54
External 3.5 Western Digital Elements Desktop 3TB 3000 GB 169.00 128.66 0.04 23.32
Harddisk Western Digital Caviar Green 1TB 1000 GB 62.00 47.20 0.05 21.19
Harddisk Western Digital Caviar Green 3TB 3000 GB 199.00 151.50 0.05 19.80
Harddisk Western Digital Caviar Green 2.5TB 2500 GB 174.00 132.47 0.05 18.87
External 3.5 Western Digital Elements Desktop 1TB 1000 GB 77.00 58.62 0.06 17.06
Harddisk Western Digital Caviar Green 500GB 500 GB 49.00 37.31 0.07 13.40
External 2.5 Western Digital Elements SE 1TB 1000 GB 109.00 82.98 0.08 12.05
External 2.5 Western Digital Elements SE 500GB 500 GB 70.00 53.29 0.11 9.38
DVD-R Verbatim 16x DVD-R 100 @ 4,7GB 470 GB 91.00 69.28 0.15 6.78
Blu-ray Verbatim BD-R SL 25 @ 50GB(*) 1250 GB 248.00 188.81 0.15 6.62
Blu-ray Verbatim BD-R 25 @ 25GB 625 GB 136.00 103.54 0.17 6.04
DVD+R DL Verbatim 8x DVD+R DL 25 @ 8,5GB 213 GB 54.00 41.11 0.19 5.17
CD-R Verbatim CD-R 100 @ 700MB 70 GB 29.00 22.08 0.32 3.17
USB Flash Sandisk Cruzer Flash Drive 16GB 16 GB 29.00 22.08 1.38 0.72
SSD Kingston SSDnow V 100 Series 128GB (kit) 128 GB 235.00 178.91 1.40 0.72
USB Flash Sandisk Cruzer Flash Drive 32GB 32 GB 59.00 44.92 1.40 0.71
SSD Kingston SSDnow V 100 Series 256GB 256 GB 473.00 360.11 1.41 0.71
SSD OCZ SSD Vertex 2 Extended Cap. 120GB 120 GB 229.00 174.34 1.45 0.69
SSD Corsair Force F120 120GB 120 GB 239.00 181.96 1.52 0.66
USB Flash Sandisk Ultra Cruzer BACKUP 64GB 64 GB 135.00 102.78 1.61 0.62
SSD OCZ SSD Vertex 2 Extended Cap. 60GB, 60 GB 129.00 98.21 1.64 0.61
SSD Intel 320 Series 80GB 80 GB 178.00 135.52 1.69 0.59
SSD Corsair Force F60 60GB 60 GB 139.00 105.82 1.76 0.57
SSD Kingston SSDnow V+100 Series 64GB 64 GB 154.00 117.24 1.83 0.55
SSD Intel 510 Series 120GB 120 GB 299.00 227.64 1.90 0.53
SSD Corsair Force F180 180GB 180 GB 468.00 356.30 1.98 0.51
USB Flash Kingston DataTraveler 310 256GB 256 GB 726.00 552.72 2.16 0.46
USB Flash Sandisk Cruzer Flash Drive 8GB 8 GB 23.00 17.51 2.19 0.46
SSD Kingston SSDnow V Series 30GB 30 GB 89.00 67.76 2.26 0.44
SSD Corsair Force F40 40GB 40 GB 126.00 95.93 2.40 0.42
SSD Corsair P256 SSD MLC, 256GB 256 GB 844.00 642.56 2.51 0.40
Compact Flash Sandisk CF Card 32GB Extreme 32 GB 215.00 163.69 5.12 0.20
Compact Flash Sandisk CF Card 16GB Extreme 16 GB 119.00 90.60 5.66 0.18
Compact Flash Sandisk CF Card 64GB Extreme Pro(*) 64 GB 597.00 454.51 7.10 0.14

*) This offerings are no longer available from Digitec. To be removed from the list in the next round.

Exchange rate: 1 Euro = 1.313495 CHF.

Choosing an SSD

1 comment

With most technology, I choose to be a late adopter. Letting other people do the first rounds of QA has saved we loads of money. Waiting for the prices of new gadgets to drop down to reasonable levels has saved even more. So after “everybody” has gotten a SSD drive, I’m thinking it’s time to look into it.

I expect to use the drive as a boot drive for Fedora, so it should excel on the random read/write tests. I have an older motherboard which only supports SATA 3.0 Gb/s, so the high end SSD are not interesting at this point. Finally, I’m running a fan-less system, water cooled, and with Silverstone Nightjar fanless PSU, thus also opting for the quiet WD Green drivers (5400 – 7200 RPM). It means switching to SSD will be a very significant improvement, while also removing the last noise from the back-scatter of disk bound OS work.

In Anandtech’s review from November 2010, the Corsair Force drives are on top. Furthermore, he stresses the SandForce controllers as “the sensible choice” for OS and applications. At 180 Euros, the F120 is a bit pricey, while the F40 and F60 are almost the same at 98 and 105 Euros respectively. Although the F60 was not included in Anandtech’s review, it seems like a safe bet. 60 GB should also be plenty of space for the OS, swap, and basic user files (documents, e-mail, but not images or video).

As for compatibility, in the Fedora 14 documentation, they mention that ext4 is the only fully-supported file system that supports TRIM”. Furthermore, to enable the TRIM command (which is disabled by default), the drive should be mounted with the discard option. Finally, the docs states that the swap partition will use TRIM by default. In other words, everything is ready to go.

Robert Penz goes into details to bust some of the myths around SSD. He concludes that on a normal user system, you don’t need to take special consideration when switching from spinning to solid drives. Only on the advice of using “noatime” he seems incorrect, challenged by this thread: “noatime is not necessary. Fedora defaults to relatime , which is a better choice: it reduces disk access almost as much as noatime, but preserves enough atime info for practical purposes”.

Panorama

2 comments

Here is the first in a series of panorama pictures I’ve worked on over the past years. This is of Zurich, taken from Üetliberg. It is composed of 211 single 15 MP pixels, and the result is an image of 33585×6832 pixels or 230 MP.

It was stitched using the open source panorama tool Hugin, and split into tiles using ImageMagick. The panorama is rendered using a home made HTML 5 Canvas viewer. If you’re interested, the source is open source under GPL 3.

The ImageMagick commands are worth a closer look. As mentioned, the complete image is 230 MP, and to serve scaled tiles, it useful to work with something smaller. Five different scales were created from the original. Here is the basic resize command to 50%:

convert input.tif -resize 50% output_50.tif

Next, each of the resized images were tilled following the excellent instructions on the IM site. They were cropped to equal tiles, so there is only a +1 pixel difference between some of the tiles. For the current panorama, I’ve chosen to ignore that difference, and render based on the smallest.

The following gives 49 columns and 8 rows, with the first top left hand tile starting with filename tile_0.jpg. It is worth noting that not all tile sizes worked; in some cases only the first row would be produced, changing (often increasing) the tile count would work around that.

convert output_50.tif -verbose -crop 49x8@ +repage +adjoin tile_%d.jpg

Finally, I wanted to put a water mark on some of the images. Here, I also followed the IM instructions without problems. To create the “stamp”, the following did it:

convert -size 300x50 xc:grey30 -font FreeSans-Medium -pointsize 20 -gravity center -draw "fill grey70  text 0,0  'Copyright'" stamp_fgnd.png
convert -size 300x50 xc:black -font FreeSans-Medium -pointsize 20 -gravity center -draw "fill white  text  1,1  'hblok.net' text  0,0  'Copyright' fill black  text -1,-1 'Copyright'" +matte stamp_mask.png
composite -compose CopyOpacity  stamp_mask.png  stamp_fgnd.png  stamp.png
mogrify -trim +repage stamp.png

To apply the stamp to an image, e.g. tile_20.jpg

composite -gravity SouthEast -geometry +10+10 stamp.png tile_20.jpg watermarked.jpg

Bad Behavior has blocked 114 access attempts in the last 7 days.