VirtualBox

Leave a comment
Ops

One the very few things that I need Windows for now is JavaME development. The Netbeans mobility pack runs under Linux but unfortunately all the manufacturers’ emulators require Windows. So now that I have sold my Windows laptop I needed some other way of running Windows. I considered using bootcamp on my MacBook or even repartitioning my linux box to dual boot Windows but I am now using some VM software called VirtualBox.

I actually have a copy of VMWare 4.5 but that doesn’t play nice with the latest Fedora versions so rather than pay to upgrade VMWare I am using VirtualBox which is free for personal use. So far it runs really nicely on my dual-core Linux machine.

A few things I needed to get it going under FC6:

  • Install the version for all distributions
  • Install compat-libstdc++-33
  • Make sure you put any users into the vboxusers group
  • Mess around with SELinux settings as described here

MySQL Users Conference – Day 4

Leave a comment
Dev

Well, it is the last day of the MySQL Users Conference. It’s been really interesting with lots of great sessions and keynotes. Today was a bit shorter and I think the programme was the weakest as far as sessions went but here are the ones I attended.

Rasmus gave a keynote today with the rather strange title “PHP on Hormones“. He showed us an example of “PHP v1.0” and gave four reasons why people contribute to open source:

  1. Self interest
  2. Self expression
  3. Hormones (oxytocin nature’s “trust hormone” is released through interaction with others)
  4. Improve the world

He made the good point that Web 2.0 is all about the contributions from the users and successful sites harness network effects and get better the more people use them in a way that caters to their own interest.

It’s not what people think about you, but rather what they think about themselves.

This is a fundamental law of human nature that is worth remembering. People like to feel important and, in general, are self-interested.

He went over a lot of the same stuff as he talked about in his other session on security (including defacing the conference website through its XSS vulnerability šŸ™‚ ).

Another interesting keynote was Scaling MySQL at YouTube by Paul Tuckfield. This was packed with some fairly technical stuff and he didn’t quite have time to go through it all in enough detail but here are some key points:

The top reasons for YouTube scalability are:

  • python
  • memcache
  • replication

They originally had a separate database running MyISAM to do click tracking and ran everything else on an InnoDB database with replication. YouTube have far more reads than writes so replication works well for them but it is a constant battle to keep replication lag down.

One way they worked around replication lag was to have separate replication pools – certain pages are sent to certain replicas. This lowered lag and increased cache affinity.
The other way to speed up replication is with hardware. Their standard DB boxes are:

  • 4x2GHz cores
  • 16GB RAM
  • 12x10k RAID10

Paul is a big believer in the power of disks. He suggests rather than upgrading CPUs people should increase RAM and spindle count.

He also had some good suggestions for optimising drives and caching. Obviously if you can fit your DB in RAM that is great but otherwise writes should only be cached by the RAID controller and reads should only be cached by the database. This is because the database will have the largest cache so a hit in database cache means the lower caches went unused but a miss in the database cache can only miss in other caches because they’re smaller and not as smart.

Implementing High Availability with DRBD was an interesting discussion of the Distributed Replicated Block Device (DRBD). DRBD can give you more availability than standard replication but less than cluster. Five nines of uptime means your site has to be up for all but five minutes in a year.

DRBD is kind of like network RAID1. While standard replication is asynchronous, DRBD uses synchronous block replication. It is set up with heartbeat in a passive/active failover setup. You can still use replication with this by pointing the slaves at the floating IP.

John Allspaw gave a session on capacity planning at flickr. They use ganglia for monitoring which looks pretty cool and the following tools for deployments:

In the High Performance Data Warehousing with MySQL session, Brian Miezejewski made the point that partitioning is the key but the scheme should be flexible to grow without table rebuilds.

He suggested MyISAM should be used for most DW needs because the data files are smaller and they are easy to copy between servers and for most DW queries they are faster than InnoDB. However, there are problems with table locking for loading high volumes of real-time data, deletes will break concurrent inserts, there is no index clustering and table repairs on big tables take a LONG time.

InnoDB is good for high volume streaming inserts (although if you ONLY ever insert then you could use MyIsam). It is also good if ALL queries can take advantage of index clustering (innodb clusters the data with its primary key and lookups using the PK are faster than keyed lookups in myisam) but if you don’t use the primary keys performance can go down.

Primary keys in InnoDB should be kept small as they are stored with every secondary as well.

Brian also talked about the new Nitro storage engine. This proprietary engine seems to do some pretty amazing stuff and I was talking to Terry Christensen from Nitro Security about it the other day.

It can do very high insertion rates (20k inserts/sec) with simultaneous queries. Also, it has ultra-high performance on aggregate operations (sum, avg, etc) on index values. It uses some ingenious indexing for this. It runs well on 32bit machines and can use partial indexes in ways other engines cannot.

The only downsides seem to be that it doesn’t do transactions and it costs $15,000 per server (this includes a full MySQL Enterprise Edition). Still, the stats for this engine are very impressive.

For backing up very large tables, Brian suggests copying the data files or if you partition you only need to backup the partition that has changed. He also described a trick for building aggregates on the fly using insert on duplicates update. Eg given a sales aggregate table with primary key(STORE,DATE):

 INSERT INTO SALES 
 VALUES(STORE,DATE,AMOUNT,TOTAL,COUNT)
 ON DUPLICATE KEY
 TOTAL = TOTAL+AMOUNT
 COUNT = COUNT+1;

The closing keynote was by Pasha Sadri from Yahoo! talking about Yahoo! Pipes. This is really cool. It lets you stitch together feeds and web services to make your own customised feeds.

The conference has been tiring but worth it. There are a lot of really smart people here and it’s been good to see how web 2.0 companies like Digg and Flickr have dealt with scaling.

MySQL Users Conference – Day 3

Leave a comment
Dev

It was the third day of the MySQL Users Conference and it was just as packed with good stuff as the days before.

Eben Moglen, Professor of Law and Legal History at Columbia University Law School and General Counsel of the Free Software Foundation gave a very thoughtful and interesting keynote entitled “Freedom Businesses Protect Privacy”. He talked about how the type of information stored has changed over the centuries and how the most powerful holders of information in the 21st century are private data mining organisations (using data voluntarily given) – not governments. He also made the interesting point that if Alice uses, say, gmail and Bob sends her an email then he too is being data mined even though it was Alice that shared her data.

His suggestion was to protect our private information by “storing it yourself” and then using freely available tools to mine your own data. I’m not sure I agree. Firstly, I don’t really want to store and analyse all my purchases at Amazon to determine new titles I might be interested in. That just isn’t going to happen. And secondly, don’t I really only accrue the benefit of data mining when thousands of other users’ data can be aggregated to determine trends? I’m not sure how useful a single person’s data is, especially to that person. Don’t get me wrong, I am all for protecting my private data but when you collect a sufficiently large sample of suitably anonymised data you can create services of real benefit to those who share their data. That said, I think the statistic of 1.6 CCTV cameras for every UK citizen is fucking obscene.

There were plenty of good sessions today too. This conference is really packed with smart people giving interesting talks.

Federation at flickr by Dathan Pattishall was a packed session. Flickr is different from a lot of the web apps being discussed here because it is quite write intensive. To deal with this they required a multi-master setup – something that is a little tricky with MySQL. They also needed no single point of failure and had to be able to serve pages FAST.

Federation refers to the storage engine. Basically it is another sharding setup. The main components of the setup are:

  • The shards (slices of the main database) – 2 servers in master-master replication
  • Global Ring or Lookup Ring which provides the ability for the app to access a shard. It is kind of like DNS.
  • PHP logic to connect to the shards (10 lines of PHP code!)

The shards are setup in Active Master-Master Ring Replication. Because of this they needed to externalise the autoincrement IDs so they are globally unique. There are two servers using circular replication for this. These are called the “ticket” servers and they have a single row containing the max ID. They use the new 5.0 autoincrement offset stuff.

Users are assigned to a shard at random. This is quite balanced because they have so many new users. Occasionally they need to migrate users to balance data but this is quite easy. A user is always sent to a single machine during their session so they don’t get replication lag, ie not seeing changes immediately.

Each server in a shard is 50% loaded so they can shut down half the servers in a shard for maintenance.

When searching they need to combine results from shards so if one shard is slow the whole query is slow. To avoid this they use two search back ends. Most search is handled by proprietary yahoo search but some real time searches go to the shards.

The servers are 64bit RHEL4 16GB RAM 15K RPM RAID10 and their data size is about 12TB of meta data (not including the image data which is stored on a filer.) They use InnoDB file per table so space can be reclaimed from deletes. It is also easier for backups and data migration.

Database configuration:

  • Max connections: 400 per server in shard (800 across both servers in a shard)
  • Thread concurrency 32
  • 5% of 12TB in memory so 13GB for innodb_buffer_pool

When doing a schema roll out they shutdown half the servers in a shard and force traffic to other server, do the ALTER TABLE, then bring it up and do the other server.

Both of the Amazon AWS sessions were scheduled at the same time. Luckily I went to the S3 storage engine first which ran short so I could sneak into the other session too.

Mark Atwood has created a new storage engine that uses Amazon’s S3 service. Unfortunately it is not yet a replacement for a general SQL engine but it is good for storing and serving big BLOBs like images. Simple selects showed very good response times. Here is how S3 translates to MySQL:

  • AWS Account -> CREATE SERVER
  • S3 Bucket -> Table
  • S3 Item -> Row
  • S3 Item Key -> VARCHAR PRIMARY KEY
  • S3 Item Contents -> BLOB or VARCHAR

Here is how you would create a table:

CREATE TABLE 'bierce' ( 
'word' VARCHAR(255) NOT NULL PRIMARY KEY, 
'defn' BLOB) 
CHARSET=utf-8 
ENGINE=AWSS3 
CONNECTION='awss3 DevilDictionary $server MyAWSAcct'; 

There are some problems like:

  • Not transactional
  • No temporal guarantees – IE it could take some time for someone connected to a different AWS datacenter to see your latest insert
  • Bad queries can cost real money. Full table scans are bad.

This engine can only be used with 5.1.

When Mark’s talk finished I went over to the session by Gururaja Rao from wipro on running cluster on Amazon EC2. I missed the first bit of this session but after speaking with him afterwards it looks like they have been very successful running cluster on EC2 (although only 5.0 not 5.1’s on disk NDB so far). His plan is to build a service that monitors the state of the cluster and automatically provision a new EC2 instance if one should fail. His architecture also involves running backups to S3 for disaster recovery. I asked him if there was a limit to the size of the cluster you can run and he didn’t seem to think there would be a problem running a 100 node cluster. Not sure how easy that would be to manage though…

This sounds really cool and would certainly be great for high availability. However, what would be even cooler is if this could be combined with the ability to automatically add nodes to the cluster depending on the server load. I think this would be harder though as I don’t think you can just add another node to MySQL without restarting the cluster. I think this is planned for 5.2.

There is a also the small chance that you have multiple members of your cluster on the same physical box. If you were to lose this machine then that could be bad for high availability…

This definitely looks promising although I don’t think it is something we would look at immediately as it adds quite a bit of complexity and I don’t think we would need that level of availability to begin with. Also, there is the small matter that this is just an idea and is not built yet. šŸ™‚

I was chatting to fellow Aussie Adam Donnison at lunch yesterday and today I sat in on his session on running the MySQL websites. It was interesting to note the similarities of the mysql setup with our m3 infrastructure although he uses ldirectord/lvs for software load balancing (they used to use RR DNS!)

It was all fairly obvious stuff for a competent sysadmin though: separation of server functions, redundancy, minimising network traffic, handling failover with heartbeat. He did make a good point about redundancy though; that you should only pay for it where it is needed. So if you have designed your architecture to survive the loss of a machine, you don’t really need redundant power supplies or disks or anything on that box. You may as well just save the money.

In the future they plan to introduce a globally distributed setup; at the moment the servers are all in in one location. Listening to MySQL say they used to use RR DNS as load balancing and that they are planning a multi-site setup, and the amount of data that other sites like Digg (not flickr!) manage made me realise how sophisticated our M3 setup was (is). We had a global load balancing setup and managed a 1TB database driving 20 web servers.

Tobias Eggendorfer’s talk about German site Wundermedia was a bit weird. It was mostly about how they managed to scale their site with zero budget by writing their own caching engine and using some crazy jury-rigged replication scheme using SCP. Or maybe I dreamed that.

Andi Gutmans, who bears a striking resemblance to Mr Bean, gave a session on the state of PHP. Apparently 75% of new PHP apps are using PHP5. Another 75% statistic that was more surprising was the percentage of PHP developers who develop on Windows. He spent some time discussing the Zend Framework and I was interested to learn about its built in support for web services. Not just standard stuff like REST and SOAP but built-in support for Amazon, flickr, yahoo, etc. It also has full PDF read/write support. I would be interested to see how its API compares to the R&OS PDF class which is a bit nasty.

The reason they are moving towards an eclipse development platform is to gain tighter integration with Java. They support the Eclipse PDT project and v1.0 of this should be released in September.

Another really interesting PHP session was given by Mike Ho on creating web services in four lines using Qcodo. This is a really cool looking framework that has stuff like support for writing AJAX in PHP, ORM using code generation and some awesome WSDL generation abilities.

In his presentation he really did create web services that generated nasty big WSDL documents in a few lines of code. He created a simple web service class and then proceeded to access this from c# code running in Visual Studio in a Parallels desktop (an aside: there are a LOT of Macs here. At least half of the presentations were given on Macs). Here is an example:

require 'includes/prepend.inc.php';
class MyService extends QSoapService
{
    /**
     * return string
     */
    public function foo() {
        return 'foo';
    }
}
MyService::run();

This code will generate the necessary WSDL to provide a SOAP service.

Ed Presz from Ticketmaster talked about how they use replication to scale. Some nice figures were:

  • 14,000 tickets sold in a minute
  • 500,000 tickets sold in a day

Again, replication works well for them because they have a high proportion of reads. They currently use MySQL 4.0 (they used to use MS SQL Server) but hope to move to 5.0 this year.

They use a large number of “chained” slaves. So a slave becomes a master to another slave. This gives them a lot of redundancy.

Interestingly they replicate across the Atlantic. Their servers in the UK log the transactions and then they are replicated across the pond for reporting via a 10mb WAN link. They standardise all their times in UTC.

Lots of people have been saying how setting innodb_flush_log_at_trx_commit=0 gives a huge improvement in performance – at most you can lose 1 sec of data. Usually this is worth it for the gain in speed.

MySQL Users Conference 2007 – Day 2

Leave a comment
Dev

Today was the first day of the main conference and at breakfast I met Daniel Bernholc from TradeDoubler‘s Stockholm office. It was interesting to hear about the TradeDoubler architecture. They do 25,000 RPS using only about 8 Sun Opteron web servers. All their application code is written in Java and they use a custom written Apache module for delivery.

Guy Kawasaki gave a very entertaining keynote called “The Art of Innovation” and Brian Stevens from Red Hat talked about the OLPC project.

The first session I went to was Rasmus talking about PHP performance and security. There is a new MySQL driver for PHP called MySQL ND (Native Driver) which should, in theory, give better performance but it seems in real applications this is not the case.

Rasmus used an example of a single Drupal page view to demonstrate performance tuning. Like Domas Mituzas, he also recommends using xdebug with kcachegrind for profiling. He also demonstrated using callgrind running via valgrind to provide input to the kcachegrind visualizer. He started apache under vallgrind:

valgrind --tool=callgrind --dump-instr=yes -v /usr/sbin/httpd -X
chmod a+rw callgrind.out.XXXX

A good tip when analysing kcachegrind output is if there are lots of boxes, that is generally a good thing. It means you are “distributing” your processing and no one part of your code is taking all the CPU time.

Also like Domas, Rasmus recommends the APC opcode cache.

The second part of Rasmus’s session was on PHP security. I found this very interesting as he went over the different types of XSS attacks and how to avoid them. You can check if a page is vulnerable to XSS by seeing if you can run alert(). If you can run an alert() you can run any javascript in the user’s browser. Basically, to avoid XSS problems you just need to make sure you do not display user data back to the user unfiltered. To help with this there is the filter PHP extension. This comes as standard in PHP 5.2 but can be used with 5.0/5.1 using PECL. XSS vulnerabilities are very common on the web and he showed us several examples of pages that were vulnerable including the IBM Developerworks article on how to prevent XSS atacks!

Rasmus then moved on to Cross Site Request Forgery (XSRF). Basically, this is when data is submitted to a page from an unauthorised source. The way around this kind of attack is to use what Rasmus calls “crumbs”. These are bits of data in the submission form that contain user-specific data.

The slides from his talk are here.

The session by two guys from digg.com about their architecture was interesting. They scale by having many web servers behind a load balancer talking to many MySQL slave instances. Each web server opens connections to a random DB server. A memcached server farm is used to only cache chunks of content, not whole pages. A lot of people seem to be talking about memcached at this conference and it is something I had heard of but never really looked at. It looks pretty neat but most of the people talking about it here have far more reads than writes to their database so I’m not sure how useful it would be for us.

Sharding is another technique that a lot of people seem to be implementing. Sharding is basically partitioning your database over a number of machines, breaking a big DB into smaller ones. But unlike, say, cluster, you need to manage the shards yourself in your application code. This can give performance improvements but it comes at the expense of more complicated code and you can’t do SQL joins across shards. Those are some pretty big downsides so it won’t suit everyone.

Again, they recommend using the APC opcode cache. I was surprised to hear that Digg only have about 30GB of data.

The SQL Antipatterns talk by Bill Karwin from Zend was also good. It was a quick run through common mistakes people make with their SQL:

Comma separated lists
This means storing a list of values in a varchar column and has a number of problems:

  • You can’t enforce referential integrity
  • It is hard to add a value to a list
  • It is difficult to search
  • You can’t even be sure the values in the list are integers

If you have this problem you probably have a many to many relationship which can be solved using an intersection table.

Ambiguous GROUP BY
This is the old problem of MySQL not enforcing the SQL standard for GROUP BY. You can use non-aggregated columns or calculations in the SELECT list that do not appear in the GROUP BY clause. This can cause some unexpected results, as the manual says:

Do not use this feature if the columns you omit from the GROUP BY part are not constant in the group. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.

You can solve this problem by either following the SQL standard or using the GROUP_CONCAT function.

Entity-Attribute-Value Tables
This is a common way to store things like user attributes or preferences but there are some problems with this approach:

  • There is no way to declare an attribute to be NOT NULL
  • You cannot enforce referential integrity
  • It is expensive to reconstruct the attributes in a single row

These problems can be solved by:

  • Only use it as a last resort
  • Enforcing constraints on attribute values in the application logic
  • Don’t fetch all the data as a single row causing big joins

Other tips were not letting users run their own ad-hoc SQL queries on the DB(!) and avoiding SQL injection by using prepared statements where values are replaced after the SQL has been parsed.

The performance tuning sessions are always popular and Peter Zaitsev is a smart guy. His session on server performance tuning was standing room only. He explained that you can get the most benefit out of tuning just a few parameters like innodb_buffer_pool_size.

Another interesting session was given by Farhan Mashraqi from Fotolog. According to Alexa this is one of the top 25 sites on the web with millions of members. It is different from something like flickr because users may only upload a single image per day (Gold members get 6).

Their site runs on about 30 Solaris servers divided into four clusters (there’s that sharding again). Each shard has a number of tables that were originally partitioned by username but this gave uneven distribution (which seems obvious if you think about it.) They use Akamai and I was interested to hear that they rewrote their entire site from PHP into Java. The reason the CTO gave was that the developers hated PHP…

Originally their site ran on MyISAM tables which predictably caused concurrency problems as their traffic grew but I was interested to hear that InnoDB will also create table locks when inserting an autoincrement primary key.

Farhan then went into some detail about how they managed to lower their I/O by increasing the number and reordering the columns referenced in their primary key. I want to go over his presentation again when it is posted on the conference site as he seemed to gain a huge performance increase from this but I suspect it is quite specific to their data.

MySQL Users Conference 2007 – Tutorial Day

Leave a comment
Dev

Today was tutorial day at the MySQL Users Conference. I got there early to register and get some breakfast before the first session.

The tutorial day is different from the rest of the conference in that there are only two sessions, each of three hours. The length of these sessions means the speakers can go into more detail than in a normal 45 minute session.

In the morning session I attended a session given by Brian Aker called MySQL 5.1 in Depth. Brian is the director of architecture for MySQL and a very good speaker. He also has some amazing hair which you can only begin to appreciate from his photo.

The next version of MySQL promises some exciting features. First up, partitioning. This has the potential to increase database performance because only those partitions that are needed are actually accessed for a query. For example, you could partition a table by date range:

CREATE TABLE RANGE_BY_DATE
(
    customer_number INT NOT NULL,
    customer_order VARCHAR(50) NOT NULL,
    customer_order_date DATETIME NOT NULL)
PARTITION BY RANGE(YEAR(CUSTOMER_ORDER_DATE))
(
PARTITION P1 VALUES LESS THAN (2000),
PARTITION P2 VALUES LESS THAN (2003),
PARTITION P3 VALUES LESS THAN (2005),
PARTITION P4 VALUES LESS THAN MAXVALUE
)
;

Here we have partitioned the table into four partitions. Now a query selecting data from 2001 would only ever need to touch a single table.

There are other ways to partition a table as well, for example, ‘hash’ partitioning is good for spreading out data across a number of spindles:

CREATE TABLE HASH_EXAMPLE (col1 INT, col2 CHAR(5), col3 DATE)
PARTITION BY HASH(col1)
(
PARTITION P1 DATA DIRECTORY = '/var/lib/mysql/data1',
PARTITION P2 DATA DIRECTORY = '/var/lib/mysql/data2',
PARTITION P3 DATA DIRECTORY = '/var/lib/mysql/data3',
PARTITION P4 DATA DIRECTORY = '/var/lib/mysql/data4',
)
;

If we put each of these data files on separate disks we can gain a reduction in I/O contention.

You can even have sub or composite partitions:

CREATE TABLE sub_example
(
    customer_number INT NOT NULL,
    customer_order VARCHAR(50) NOT NULL,
    customer_order_date DATETIME NOT NULL,
    customer_service_region INT NOT NULL)
PARTITION BY RANGE(YEAR(customer_order_date))
SUBPARTITION BY HASH(customer_service_region)
(
PARTITION P1 VALUES LESS THAN (2000) (SUBPARTITION S0, SUBPARTITION S1),
PARTITION P2 VALUES LESS THAN (2003) (SUBPARTITION S2, SUBPARTITION S3),
PARTITION P3 VALUES LESS THAN (2005) (SUBPARTITION S4, SUBPARTITION S5),
PARTITION P4 VALUES LESS THAN MAXVALUE (SUBPARTITION S6, SUBPARTITION S7)
)
;

There are a few things to keep in mind:

  • Any storage engine may be partitioned
  • There is a limit of 1024 partitions (including any subpartitions)
  • If a value doesn’t match a ruleset it will be placed in the final partition
  • Hash partitioning can give you high insertion rates if you put each partition on a separate spindle. However, it doesn’t make much sense for InnoDB tables as InnoDB only uses one big table space anyway.
  • You don’t necessarily want to create hundreds of partitions in an attempt at future-proofing your database design as this will be a problem when using engines like MyISAM that will then have to open up hundreds of file descriptors.

Aparently ALTER TABLE queries will case a physical rebuild of the table although Brian mentioned that some engines are now being smarter about this and may not necessarily require a rebuild for every ALTER TABLE query.

MySQL 5.1 also finally gets disk-based clustering so you no longer need twice your database’s size of RAM to use cluster (NDB). Indexes still need to be kept in memory however.

You create the tablespace using SQL rather than my.cnf configurations:

CREATE TABLESPACE ts1
ADD DATAFILE 'datafile.dat'
USE LOGFILE GROUP lg1
INITIAL_SIZE 12M
ENGINE NDB;

CREATE TABLE t1
(pk1 INT NOT NULL PRIMARY KEY,
b INT NOT NULL,
c INT NOT NULL
)
TABLESPACE ts1 STORAGE DISK
ENGINE NDB;

A cool feature, especially for admins, is the task scheduler. This is basically cron inside the database. There is a new object called an ‘event’ that can be used to create one-time or recurring tasks. These are just stored procedures, eg:

DELIMITER //
CREATE EVENT optimise_tables
ON SCHEDULE EVERY 1 WEEK
STARTS '2006-03-05 1:00:00'
ON COMPLETION PRESERVE
DO 
BEGIN
OPTIMIZE TABLE test.table1;
OPTIMIZE TABLE test.table2;
END
//

I like the sound of these. They let you incorporate regular maintenance tasks into the database rather than relying on external scripts. Also, being stored procedures, you can make them as smart as you want.

MySQL 5.0 had mysqlimport but in 5.1 it is has a new option “–use-threads” to let you specify multiple threads. This is basically a wrapper around LOAD DATA INFILE. Apparently it is even advantageous to use two threads on a single core machine.

As I mentioned yesterday, the General and Slow Query Log are now system tables. This is cool! This means switching on the general query log is as simple as set global general_log=1;. Before 5.1, switching the general query log on and off required a database restart! No more staying up late or working weekends just so you can switch the general query log on and off for a few hours. šŸ™‚

Here is what is recorded in the logs:

mysql> desc slow_log;
+----------------+--------------+------+-----+-------------------+-------+
| Field          | Type         | Null | Key | Default           | Extra |
+----------------+--------------+------+-----+-------------------+-------+
| start_time     | timestamp    | NO   |     | CURRENT_TIMESTAMP |       | 
| user_host      | mediumtext   | NO   |     |                   |       | 
| query_time     | time         | NO   |     |                   |       | 
| lock_time      | time         | NO   |     |                   |       | 
| rows_sent      | int(11)      | NO   |     |                   |       | 
| rows_examined  | int(11)      | NO   |     |                   |       | 
| db             | varchar(512) | YES  |     | NULL              |       | 
| last_insert_id | int(11)      | YES  |     | NULL              |       | 
| insert_id      | int(11)      | YES  |     | NULL              |       | 
| server_id      | int(11)      | YES  |     | NULL              |       | 
| sql_text       | mediumtext   | NO   |     |                   |       | 
+----------------+--------------+------+-----+-------------------+-------+
11 rows in set (0.00 sec)

Much more readable than the current slow query log. What’s even cooler is that these tables use the CSV engine which is both human and MS Excel readable so you can either tail the table file or load it straight into Excel. You could also use the task scheduler to create an event that truncates the log tables every day.

mysqlslap is a new command-line load-testing tool that can simulate many concurrent connections to the database and repetetively run load tests. It is kind of like Apache’s ab but for databases.

It has a few cool features like autogeneration of SQL and can sequentially test different engines. For example you could pass a list of engines like this:

--engine=innodb, falcon, myisam

and compare the performance across the different storage engines.

Because this is a MySQL client utility you can run it against any version of MySQL >= 3.23.

There are a few new storage engines for 5.1 including PBXT which is an engine that promises the speed if myisam with transactions (although not fully ACID). This is implemented with 5.1’s new plugin architecture which means you can drop this storage engine directly into your running database. There is the crazy HTML storage engine implemented with GETs and PUTs and the AWS storage engine which Mark Atwood is giving a session on on Wednesday.

For the second session, I went to hear Domas Mituzas talk about what goes on behind the scenes at Wikipedia. Domas was a funny guy and it was interesting to hear how Wikipedia operates with mostly volunteer sysadmins and cheap commodity hardware.

They use a lot of custom code to handle things like database load balancing and caching. They use PowerDNS to handle geo-distribution.

A lot of effort goes into caching; not just ensuring as much is cached as possible but that the cache can easily be invalidated when an object is updated. Basically, Wikipedia’s policy is to cache as much as possible inside their network but once it leaves the Wikipedia network the private cache-control bit is set so clients will always request a new copy of the data.

Wikipedia do about 30,000 requests per second so logging to files is out of the question. They send their logs to the network where anyone that is interested in them can pick them up.

Some of the customisations to the standard LAMP stack are:

  • PHP’s strtr() function has been replaced with FastStringSearch
  • Imagemagick is used for thumbnailing rather than PHP GD
  • They use APC as a bytecode cache; Domas thinks eaccelerator is too buggy
  • There is a Tex filter to render formulae and other scientific content that is written in ocaml and no-one knows how to change it!

Wikipedia has realtime profiling written into their code. Also, they use xdebug combined with kcachegrind which creates some nice output showing a tree of calls and execution times. This is something that could be useful to us at the moment as we are working on profiling our code.

All their database servers run RAID0 – they expect servers to fail and so they put their redundancy into the servers themselves rather than the disks. It is easier for them to promote a slave database to master rather than run a server on a degraded RAID array or wait for its database to recover after a crash.

The text of Wikipedia articles is not actually stored in the database tables. The tables only contain metadata which points to external storage (more cheap disks) where the text of articles is stored (in compressed format).

Many of the Wikipedia tables have more indexing than data reflecting the high number of reads compared to writes.

It was also interesting to hear about the different culture of Wikipedia. It is mostly run by volunteers on cheap hardware and their attitude to uptime is what you might call relaxed but it seems to work for them. As Domas said, they don’t mind too much if they have a little downtime because when Wikipedia goes down it displays the “please donate” screen. šŸ™‚

Santa Clara

Leave a comment
Dev

I have arrived in Silicon Valley where I will be spending the next week at the MySQL Conference & Expo 2007. I’m pretty excited about it.

It turns out California is quite a long way from London. After a ten hour flight I caught the big double-decker Caltrain train south along the Bay to Mountain View where I transferred to the Valley Transportation Authority Light Rail. I was soon rolling past the offices of Yahoo, Net App and Lucent, the huge hangars and radar dishes of NASA Ames and Lockheed Martin and the deliciously named Chubby’s Broiler.

The train carried on down W Java Dr and it wasn’t long before I reached the Santa Clara Convention Center and my hotel (the Hilton) directly opposite. My room is pretty nice and from my window I can see the roller coasters of the Great America theme park.

Now I just need to plan what sessions to attend but looking at the schedule I have the usual problem of there only being one of me. Tomorrow is tutorial day and although my registration is sessions-only at the moment, apparently I can upgrade it to the tutorials tomorrow. Unfortunately Scaling and High Availability Architectures is sold out so I might go along to hear about all the new stuff in MySQL 5.1 in the morning.

5.1 has some cool new features like partitioning (which could allow us to eliminate tables split by date in our application) and logs on demand. In fact there is another session just on the new logging capabilities at the conference. Check out this post for some examples of what you can do when the logs are stored in the database itself.

Learning how to write your own storage engine sounds intersting but not something I see myself doing any time soon so I reckon I see how Wikipedia operates in the afternoon.