Wednesday, 5 July 2017

Data Retention and Percona Archiver

Data retention can be a bit touchy but when the alternative is to let tables grow by GB per week or per day, sometimes you just got to pick an upper limit. In my experience, suggesting something to stakeholders helps to get things rolling.

Magically I've recently "discovered" the Percona Archiver - I've been rolling my own for far too long. This tool is well documented and I shan't repeat the documentation other than to give an example along with some tidbits.

The archiver can move records to a destination table (the --dest option) OR to a file (the --file option). Both are useful and I'll show the file one because that's The Final Solution other than outright launching the nukes with --purge. Give a Select criteria (the --where option) and consider to include table maintenance (--optimize) if you are moving a lot of data.

For clarity: pt-archiver does a DELETE for each record it archives. 

# dump table from N months ago
DELYRMO=`date --date "$DELAGE months ago" +%Y%m`

# do not (!) overwrite file with something (-s) in it already
if [ ! -s "$BAKFILE" ] ; then
        pt-archiver --source h=localhost,D=$DB,t=$table --file archive-$table.txt --where "calib_aimextractor_id > 0" --optimize s --statistics
        echo "$BAKFILE has something in it, dump has been SKIPPED"
        exit -1
This is a drastically simplified script from what I used to do.

  1. Set the data retention which in this case is 6 months. The "date" command is useful for generating dates or parts thereof like the year, month, day, week whatever you need for both file names and search criteria. 
  2. File target should be some file system location locally or NFS. The file format is suitable for LOAD DATA INFILE
    • Gotcha! Loading data files is a risky thing to do and disabled by default in MySQL. Typically load the data to a non-production server, then manually extract the relevant records and insert them back into prod.
  3. Sanity check you're not stomping a file that's already there. I prefer to be safer than sorrier.
  4. Credentials should be in .my.cnf 
    • Seems obvious when you know to do it, but don't put user creds in scripts, dumbo! I did that too often :(
  5. Gotcha! If using --dest table instead of a file target, specify the host (h) and database (D) because otherwise pt-archiver makes some assumptions which may be very wrong
  6. Optimize your source (s) especially if a large number of rows are being pulled. Consider to also use destination (d) 
There's lots more guidance in the documentation and from other users Online. Some like to process larger numbers of records concurrently like with --limit and --bulk-delete, but the defaults (1 record) have been good to me as this runs relatively fast. Likewise there's options to check your slaves don't get far out of sync which again default behaviour is fast enough for me, but there's lots of powerful options to tune pt-archiver.

Take backups, test, test, test and you shouldn't need Good Luck :)

Popular Posts