SlideShare a Scribd company logo
1 of 75
Download to read offline
Dataherding
Brian Luft <brian@lincolnloop.com>

DjangoCon 2010 Portland, OR
Enough About Me
Working with Django for a few years.

With Lincoln Loop since 2008.

Clients include National Geographic, PBS, Nasuni, and redbeacon.
The Big Squeeze
Companies and organizations that have been operating for at least a few years have amassed
large amounts of content and data.

For most organizations that also means systems and tools that have been around for a while:

 • Desktop applications

 • Expensive, licensed, proprietary applications

 • Increasing frustration mounts as staff can't operate as they'd like to
Agenda
Dealing With Data in a Django Environment
• Front matter: A Few Tools and Tips

• Dealing With a Large Legacy Migration

• South in Team Environments
Django Data-Oriented Commands
 • dbshell

 • dumpdata / loaddata

 • inspectdb

 • flush / reset

 • sql, sqlall, sqlclear, sqlcustom, sqlflush, sqlindexes, sqlreset sqlsequencereset

--database option
Fixtures
Natural Keys - new feature in Django 1.2

Helps make fixtures much more portable.
Customizing Data Installation
 • post-syncdb

 • sqlcustom

    ◦ appname/sql/modelname.sql

Example:

 • More powerful syncdb

    ◦ Handle migrations, sync, and other fixture installations all in one-shot
iPython Tips
1. shell_plus

2. Input/Output Caches

3. Macros
shell_plus
Part of django-extensions - autoloads your Django models into the interpreter namespace.

 1   python bin/manage.py shell_plus
 2   From 'auth' autoload: Permission, Group, User, Message
 3   From 'contenttypes' autoload: ContentType
 4   From 'sessions' autoload: Session
 5   From 'sites' autoload: Site
 6   From 'admin' autoload: LogEntry
 7   From 'redirects' autoload: Redirect
 8   From 'south' autoload: MigrationHistory
 9   From 'categories' autoload: Category
10   From 'content' autoload: Page, Content, Guide, WikiName, WikiLink, Template, Attraction
11   From 'menus' autoload: Menu, MenuItemGroup, MenuItem
Input Cache
The input cache lets you access and eval previously input commands in a flexible manner.

1 In [5]: _i3
Output Cache
The output cache lets you access the results of previous statements.

1   In [23]: range(5)
2   Out [23]: [0, 1, 2, 3, 4]
3   ...
4   In [30]: _23
5   Out [30]: [0, 1, 2, 3, 4]
Large Output and the Output Cache
You can suppress output using a ';' at the end of a line.

The output cache prevents Python's garbage collector from removing previous results. This can
quickly use up memory. Use the cache_size setting to bump or down the cache (including 0 to
disable).
Macros
You can easily use the %macro feature to capture previous input lines into a single command.
%hist is handy for this.

You can also %store your macros so you have them available in the iPython namespace across
sessions.
Macros as Workflow Commands
Even seen this?

1   SQL Error: 1452: Cannot add OR UPDATE a child row:
2   a FOREIGN key CONSTRAINT fails (`myapp`.`categories`,
3   CONSTRAINT `fk_categories_categories`
4   FOREIGN KEY (`parent_id`) REFERENCES `categories` (`id`)
5   ON DELETE NO ACTION ON UPDATE NO ACTION)
Macros as Workflow Commands
Mix shell commands and Python:

 1   In[1]: !echo "set foreign_key_checks=0; drop table customers" | manage.py dbshell"
 2   In[2]: Accounts.objects.delete()
 3   In[3]: import sys
 4   In[4]: print "I'm getting distracted"
 5   In[5]: reload(sys.modules['apps.tickets'])
 6   In[6]: %macro resetaccts
 7   In[7]: %store resetaccts
 8   ...
 9   In[45]: resetaccts
10   In[46]: migrate
Agenda
Dealing With Data in a Django Environment
• Front matter: A Few Tools and Tips

• Dealing With a Large Legacy Migration

• South in Team Environments
Legacy Data Migration
Moving data from a big ol' crufty database to your shiny new Django application can be a wedge in
agile development processes.
Pre-requisites
• Production snapshot or reasonable imitation (this is often a PIA)

• People who know the legacy schema and all the convulted business requirements
Get a Bird's Eye View
Not all data is created alike.

 • Table inventory

 • Table data types

 • Tables with several relations
Getting Started
• inspectdb

• ORM Multi-DB support

• Create an app dedicated to the legacy schema

• Give your project a migration role (settings file dedicated to the migration)
Migration Project Role
 1   #migration_settings.py
 2   from settings import *
 3
 4   DATABASES = {}
 5   DATABASES['default'] = {
 6       'NAME': 'newscompany',
 7       'ENGINE': 'django.db.backends.mysql',
 8       'USER': 'joe',
 9       'PASSWORD': 'schmoe',
10       }
11
12   DATABASES['legacy'] = {
13       'NAME': 'newscompany_legacy',
14       'ENGINE': 'django.db.backends.mysql',
15       'USER': 'root',
16       'PASSWORD': 'youllneverguess',
17   }
Migration Project Role (continued)
1   DATABASE_ROUTERS = ['apps.legacy.db_router.LegacyRouter',]
2
3   INSTALLED_APPS += (
4       'apps.legacy',
5   )
6
7   DEBUG = False
Debug False?
Wait...

1 DEBUG = False

???
What about South?
I wouldn't use South for this type of project -Andrew Godwin, South
Let's Write a Migration!
We'd like to start pulling legacy Articles into the new application. Running inspectdb has given us
this:

 1   #apps.legacy.models.py
 2
 3   class News_Section:
 4       name = models.CharField(max_length=50)
 5       ...
 6
 7   class News_Topic:
 8       title = models.CharField(max_length=50)
 9       ...
10
11   class News_Article(models.Model):
12       section = models.ForeignKey(Section)
13       topic = models.ForeignKey(Topic)
14       ...
Our New Schema
We've decided that the legacy schema works pretty well here and we'll only make minor
modifications

 1   #apps.content.models.py
 2
 3   class Section:
 4       name = models.CharField(max_length=50)
 5       ...
 6
 7   class Topic:
 8       name = models.CharField(max_length=50)
 9       ...
10
11   class Article(models.Model):
12       section = models.ForeignKey(Section)
13       topic = models.ForeignKey(Topic)
14       ...
Mapping Table-to-Table
First we'll migrate Sections

1   #apps/legacy/migrations/sections.py
2   from apps.legacy import models as legacy_models
3   from apps.content.models import Section
4
5   for s in legacy_models.News_Section.objects.all():
6       section = Section(name=s.name)
7       section.save()
Another Table-to-Table
Next we handle Topics:

 1   #apps/legacy/migrations/sections.py
 2   from apps.legacy import models as legacy_models
 3   from apps.content.models import Topic
 4
 5   for t in legacy_models.News_Topic.objects.all():
 6       topic = Topic()
 7
 8       #map to new field name
 9       topic.name = t.title
10
11       topic.save()

We're on a roll now!
Generalizing Table-to-Table
For these types of one-to-one mappings it is easy to generalize the operation and reduce a bunch
of repetitive code. (Declarative code FTW!)

 1 def map_table(field_map, src, dst):
 2     for src_col, dst_col in field_map:
 3         setattr(dst, dst_col, getattr(src, src_col))
 4     return dst
 5
 6 MAP = ( ('name', 'name',) ,
 7         ('title', 'headline', ),
 8         ('create_date', 'date_created', ) )
 9
10 for some_old_object in legacy_models.SomeOldModel.objects.all():
11     map_table(MAP, some_old_object, ShinyNewObject()).save()
Migrating Tables with Relations
Now, time for the Article.

In this case we have foreign keys to fill so our table-mapping pattern won't get us all the way
there. We'll need to account for the relations manually.

1 for a in legacy_models.News_Articles.all():
2     article = Article()
3     article.headline = a.headline
4
5     #get the Section
6     Section.objects.get(...) #OOPS! How do we know which one?

A snag. We need to lookup the Section in the new database that corresponds to the old Article's old
News_Section. How can we reliably tell which one?
Preserve a Few Legacy Bits
We need to be able to tell which row in the legacy DB an object in the new system came from. In
addition to keeping the original ID, you may want to preserve other fields even if you don't have a
definite plan for it. Reasons in support of:

 • Reports and historical documents might reference legacy IDs

 • Affiliate APIs, 3rd party, businesses partners might have legacy IDs in their systems

 • New business rules deprecate some data, but the data might still be useful in the future

I'm in favor of sticking these directly on the models, unless it is more than a few extra fields.
Introducing Models with Baggage
Our new Section, Topics and Article Models

 1   #apps.legacy.content.py
 2
 3   class Section:
 4       legacy_id = models.IntegerField()
 5       ...
 6
 7   class Topic:
 8       legacy_id = models.IntegerField()
 9       ...
10
11   class Article(models.Model):
12       section = models.ForeignKey(Section)
13       topic = models.ForeignKey(Topic)
14       legacy_id = models.IntegerField()
15       ...
Articles Migration: Take 2
 1 for a in legacy_models.News_Articles.all():
 2     article = Article()
 3     article.headline = a.headline
 4
 5     #get the Section
 6     section = Section.objects.get(legacy_id=a.section.id)
 7     article.section = section
 8
 9     #get the Topic
10     topic = Topic.objects.get(legacy_id=a.topic.id)
11     article.save()

OK, so we're getting close. Now we just rinse and repeat with the rest of the tables in the DB.
Reality Check
Turns out the legacy system contains 75,000 Articles.
Reality Check
Turns out the legacy system contains 75,000 Articles.

By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 user
accounts, the user activity stream, the media assets, and a few other things.
Reality Check
Turns out the legacy system contains 75,000 Articles.

By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 user
accounts, the user activity stream, the media assets, and a few other things.

Also, the Articles mapping is going to need work because there are different article "types" that
were shoehorned into the system.
What Are We Up Against?
Let's look at our Article migration again:

1 for a in legacy_models.News_Articles.all():
2     ...

How big of a QuerySet can we actually handle? Beyond a few thousand objects things might get
dicey. (Don't try this at home)
What Are We Up Against?
We can switch to using the ModelManager.iterator:

1 for a in legacy_models.News_Articles.iterator():
2     ...

Memory crisis averted! (Not really...)
What Are We Up Against?
But wait, now we're making at one query per News_Article:

1 for a in legacy_models.News_Articles.iterator():
2     ...

75,000 article queries
What Are We Up Against?
1 for a in legacy_models.News_Articles.iterator():
2     article = Article()
3     article.headline = a.headline
4
5     #get the Section
6     section = Section.objects.get(legacy_id=a.section.id)
7     article.section = section

75,000 section queries
What Are We Up Against?
 1 for a in legacy_models.News_Articles.iterator():
 2     article = Article()
 3     article.headline = a.headline
 4
 5     #get the Section
 6     section = Section.objects.get(legacy_id=a.section.id)
 7     article.section = section
 8
 9     #get the Topic
10     topic = Topic.objects.get(legacy_id=a.topic.id)
11     article.save()

75,000 topic queries
Survey Says...
 • 1 Huge News_Article query

 • 75,000 Section queries

 • 75,000 Topic queries

 • 150,000 total queries

5ms connection latency = 750 sec = 12.5 minutes just in network latency
Now What?
1. ModelManager.all() is problematic once row counts get into the thousands.
Now What?
1. ModelManager.all() is problematic once row counts get into the thousands.

2. ModelManager.iterator() might not be enough.
Now What?
1. ModelManager.all() is problematic once row counts get into the thousands.

2. ModelManager.iterator() might not be enough.

Neither option is very attractive as we deal with large tables.
Let's Write a Smarter Migration
What should guide our decision making?

 • Step out of web request mental mode. We're optimizing for different needs here.

 • Want to maximize Write throughput to the application DB

 • Want to maximize Read throughput from the legacy DB

 • Cram the RAM

 • Maximize Objects per query

 • Fill the connection packets

 • Migration jobs as atomic units of work
KISS
 • Database performance features (delayed inserts, etc)

 • Database import/export data formats (Postgres COPY, MySQL LOAD INFILE)

Don't worry about getting too exotic until you've maxed out other options. A well designed job
system will give you a ton of mileage.
Difference Makers
Don't Work Blind. Make sure you know how to:

 • View the query log

 • Profile queries, measure throughput

Also helps:

 • Disable indexes on the new database until after the migration is done

 • Turn on connection compression if your client/server support it and you're going over the wire.
Bring Friends
Grab some handy tools:

 • SQL Editor, GUI Console (if you're not a CLI ninja)

 • Maatkit

 • SqlAlchemy
A Few Key Features
1. Pause / Graceful Stop

2. Resume

3. Timing

4. Logging

5. Partial Jobs

6. Strict Mode vs. Continue on Fail
Graceful Stop
The ability to cancel the process and leave data in a consistent state.
Resume
The ability to restart the process from a specific point.
Timing
The ability to record how long a job takes.
Logging
The ability to record what was done, and what went wrong
Partial Jobs
The ability to run a job against a single row, a ranges of rows, or a single table
Strict Mode
The ability to have the migration ignore errors (log them of course) or stop on any exception
Articles Migration: Take 3
One more look at our naive first stab at it:

 1 for a in legacy_models.News_Articles.all():
 2     article = Article()
 3     article.headline = a.headline
 4
 5     #get the Section
 6     section = Section.objects.get(legacy_id=a.section.id)
 7     article.section = section
 8
 9     #get the Topic
10     topic = Topic.objects.get(legacy_id=a.topic.id)
11     article.save()

Problems:

 • The legacy articles query is pulling one at a time

 • Even when we fix that, we can still only deal with one row at a time because we need to query
   the Section and Topic per old Article.

 • Not easy to make this work in parallel.
Work in Batches
We can mediate between the "all-or-little" extremes of all() and iterator() using set batch sizes.

Take a guess at a reasonable batch size. 1000 rows should be a reasonable starting point for most
situations.
A More Declarative Style
We can move the work of mapping rows out to a runner script.

 1   from apps.content.models import Topic
 2   from legacy.migration.runner import MigrationBase
 3   from apps.legacy.models import News_Topic
 4
 5   class Migration(MigrationBase):
 6       model = Section
 7       legacy_model = News_Section
 8
 9       #      legacy     application
10       MAP = (('title', 'name',))

This lets us vary the batch size and resource usage independently of the individual jobs.
Non-trivial Row Transformations
 1   from apps.content.models import Topic
 2   from legacy.migration.runner import MigrationBase
 3   from apps.legacy.models import News_Topic
 4
 5   class Migration(MigrationBase):
 6       model = Section
 7       legacy_model = News_Section
 8
 9       #      legacy     application
10       MAP = (('title', 'name',))
11
12       def process_row(self, row):
13           return (row.title.upper())
Job Objects
Generalize construction of multi-value INSERT statements

 1 class MigrationBase(object):
 2
 3     @property
 4     def column_list(self):
 5         return ','.join(self.MAP)
 6
 7     @property
 8     def values_placeholder(self):
 9         return 'DEFAULT,' + ','.join(['%s']*len(self.MAP))
10
11     @property
12     def insert_stmt(self):
13         return "INSERT INTO %s VALUES (%s)" % (self.model._meta.db_table,
14                                                self.values_placeholder, )
Handling Related Tables
A little extra work since we need to collect the legacy IDs.

Remember how we needed to look up the related model in the new database using the legacy ID?

1 for a in legacy_models.News_Articles.all():
2     article = Article()
3     article.headline = a.headline
4
5     #get the Section
6     Section.objects.get(...) #OOPS! How do we know which one?
Related Tables Pattern
 1 class Migration(MigrationBase):
 2     model = Article
 3     legacy_model = News_Article
 4     related = [{'model': Section,
 5                 'map_index': 1}]
 6
 7     MAP = ('title',
 8            'section_id',
 9            'legacy_id', )
10
11     def process_row(self, row):
12         return {'values': [row.title,
13                            None,
14                            row.id],
15                 'Section': row.section_id}

Now we can grab the related objects in a batch (one query), apply the correct new IDs in bulk, and
preserve our batch INSERT for the new objects.
Progress Report
In a real-world example, a job that was taking a few minutes was reduced to less than a second.
Run in Parallel
The advantage of the atomic-style jobs is that they can run independently.

This means we can use Queue from multiprocessing and run jobs in parallel.

 • A little added complexity since we need to make sure related tables are finished for many
   tables.
No Mercy Write Throughput
If you need big league performace, replace the local Queue with Celery.

Now we can also run jobs on multiple network nodes and even use multiple copies of the legacy
DB for improved read throughput.

We can also write to multiple application DBs for increased write throughput. Merge them at the
end.

Use cloud servers (EC2 / Rackspace)
In Case You're Still Bored
Other lovely things you'll run into:

 • Weird primary keys

 • Data inconsistencies

 • Creative user solutions to limitations in the old schema

 • All kinds of special conditions and edge cases

 • Mismatched data types, abused data types
Outside the Confines of Tidy Talk
Examples
Don't get stuck in the Django ORM tunnel. This is a very appropriate domain for using alternative
approaches.

 • ModelManager.raw()

 • Using the cursor will let you write more elegant JOIN queries

 • SqlAlchemy / Unit of Work
Agenda
Dealing With Data in a Django Environment
• Front matter: A Few Tools and Tips

• Dealing With a Large Legacy Migration

• South in Team Environments
Managing Migrations
• South has good momentum and good intentions

• Does a job well and gets out of your way

• Best if everyone has a decent understanding of how it works
Common Complaints
1. Merging branches brings migration conflicts

2. Two team members create identically numbered migrations

Turns out the solution is to Talk To Your Teammates!
Other Solutions
ChronicDB is a new product with an innovative approach to schema migrations.

Built in Python and C. A free version is available for small databases.
Questions?
Brian Luft - @unbracketed @zen_of_python
Thank you for your attention.
Data herding

More Related Content

What's hot

From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...Dave Stokes
 
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDave Stokes
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraPatrick McFadin
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized ViewsCarl Yeksigian
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search rideDuyhai Doan
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureDataStax Academy
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandraPatrick McFadin
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
 
MySQL 8.0 Operational Changes
MySQL 8.0 Operational ChangesMySQL 8.0 Operational Changes
MySQL 8.0 Operational ChangesDave Stokes
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at ScaleUnsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at ScaleAaron (Ari) Bornstein
 
MySQL User Conference 2009: Python and MySQL
MySQL User Conference 2009: Python and MySQLMySQL User Conference 2009: Python and MySQL
MySQL User Conference 2009: Python and MySQLTed Leung
 

What's hot (18)

From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
 
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
 
Python database access
Python database accessPython database access
Python database access
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basics
 
MySQL 8.0 Operational Changes
MySQL 8.0 Operational ChangesMySQL 8.0 Operational Changes
MySQL 8.0 Operational Changes
 
Mentor Your Indexes
Mentor Your IndexesMentor Your Indexes
Mentor Your Indexes
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at ScaleUnsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at Scale
 
MySQL User Conference 2009: Python and MySQL
MySQL User Conference 2009: Python and MySQLMySQL User Conference 2009: Python and MySQL
MySQL User Conference 2009: Python and MySQL
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 

Viewers also liked

Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA WeekData Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA WeekCarly Strasser
 
Presentation1
Presentation1Presentation1
Presentation1ltiatia
 
Presentation1
Presentation1Presentation1
Presentation1ltiatia
 
élections municipales 2014 application mobile sur smartphone avec qr-code fla...
élections municipales 2014 application mobile sur smartphone avec qr-code fla...élections municipales 2014 application mobile sur smartphone avec qr-code fla...
élections municipales 2014 application mobile sur smartphone avec qr-code fla...kaook
 
AYE 250 Brochure Turkish
AYE 250 Brochure TurkishAYE 250 Brochure Turkish
AYE 250 Brochure TurkishAbdullah Özel
 
Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...
Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...
Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...kaook
 
Immobilier : Toujours pas d'application gratuite pour les clients sur google ...
Immobilier : Toujours pas d'application gratuite pour les clients sur google ...Immobilier : Toujours pas d'application gratuite pour les clients sur google ...
Immobilier : Toujours pas d'application gratuite pour les clients sur google ...kaook
 
Qr code generator sunset code marketing mobile augmenter referencement avec v...
Qr code generator sunset code marketing mobile augmenter referencement avec v...Qr code generator sunset code marketing mobile augmenter referencement avec v...
Qr code generator sunset code marketing mobile augmenter referencement avec v...kaook
 
Nouveaux services pour les professionnels de l'impression : impression digita...
Nouveaux services pour les professionnels de l'impression : impression digita...Nouveaux services pour les professionnels de l'impression : impression digita...
Nouveaux services pour les professionnels de l'impression : impression digita...kaook
 
Applications gratuites agence immobiliere developpées par Sunset Code pau paris
Applications gratuites agence immobiliere developpées par Sunset Code pau parisApplications gratuites agence immobiliere developpées par Sunset Code pau paris
Applications gratuites agence immobiliere developpées par Sunset Code pau pariskaook
 
Det sociale intranet - Advodan
Det sociale intranet - AdvodanDet sociale intranet - Advodan
Det sociale intranet - AdvodanMartin Frederiksen
 
Statistiques marché du mobile et du smartphone SunsetCode en France
Statistiques marché du mobile et du smartphone SunsetCode en FranceStatistiques marché du mobile et du smartphone SunsetCode en France
Statistiques marché du mobile et du smartphone SunsetCode en Francekaook
 
Presentation1
Presentation1Presentation1
Presentation1ltiatia
 
Chapter 2 section 1
Chapter 2 section 1Chapter 2 section 1
Chapter 2 section 1M
 
3. enfermedades que ocasiona la posición incorrecta frente al computador
3. enfermedades que ocasiona la posición incorrecta frente al computador3. enfermedades que ocasiona la posición incorrecta frente al computador
3. enfermedades que ocasiona la posición incorrecta frente al computadorEducación
 
Chateau chalon doc
Chateau chalon docChateau chalon doc
Chateau chalon docplool1
 

Viewers also liked (20)

Data Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA WeekData Herding for Scientists - UC Davis OA Week
Data Herding for Scientists - UC Davis OA Week
 
Presentation1
Presentation1Presentation1
Presentation1
 
Data herding
Data herdingData herding
Data herding
 
Changers
ChangersChangers
Changers
 
Presentation1
Presentation1Presentation1
Presentation1
 
élections municipales 2014 application mobile sur smartphone avec qr-code fla...
élections municipales 2014 application mobile sur smartphone avec qr-code fla...élections municipales 2014 application mobile sur smartphone avec qr-code fla...
élections municipales 2014 application mobile sur smartphone avec qr-code fla...
 
Spotting Opportunities Hidden in Plain Sight (FENG Charlotte)
Spotting Opportunities Hidden in Plain Sight (FENG Charlotte)Spotting Opportunities Hidden in Plain Sight (FENG Charlotte)
Spotting Opportunities Hidden in Plain Sight (FENG Charlotte)
 
AYE 250 Brochure Turkish
AYE 250 Brochure TurkishAYE 250 Brochure Turkish
AYE 250 Brochure Turkish
 
Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...
Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...
Sunset Code Pau Paris Application samsung Iphone pour les professionnels de l...
 
Immobilier : Toujours pas d'application gratuite pour les clients sur google ...
Immobilier : Toujours pas d'application gratuite pour les clients sur google ...Immobilier : Toujours pas d'application gratuite pour les clients sur google ...
Immobilier : Toujours pas d'application gratuite pour les clients sur google ...
 
Qr code generator sunset code marketing mobile augmenter referencement avec v...
Qr code generator sunset code marketing mobile augmenter referencement avec v...Qr code generator sunset code marketing mobile augmenter referencement avec v...
Qr code generator sunset code marketing mobile augmenter referencement avec v...
 
Nouveaux services pour les professionnels de l'impression : impression digita...
Nouveaux services pour les professionnels de l'impression : impression digita...Nouveaux services pour les professionnels de l'impression : impression digita...
Nouveaux services pour les professionnels de l'impression : impression digita...
 
Applications gratuites agence immobiliere developpées par Sunset Code pau paris
Applications gratuites agence immobiliere developpées par Sunset Code pau parisApplications gratuites agence immobiliere developpées par Sunset Code pau paris
Applications gratuites agence immobiliere developpées par Sunset Code pau paris
 
Dhanlaxmi bank
Dhanlaxmi bankDhanlaxmi bank
Dhanlaxmi bank
 
Det sociale intranet - Advodan
Det sociale intranet - AdvodanDet sociale intranet - Advodan
Det sociale intranet - Advodan
 
Statistiques marché du mobile et du smartphone SunsetCode en France
Statistiques marché du mobile et du smartphone SunsetCode en FranceStatistiques marché du mobile et du smartphone SunsetCode en France
Statistiques marché du mobile et du smartphone SunsetCode en France
 
Presentation1
Presentation1Presentation1
Presentation1
 
Chapter 2 section 1
Chapter 2 section 1Chapter 2 section 1
Chapter 2 section 1
 
3. enfermedades que ocasiona la posición incorrecta frente al computador
3. enfermedades que ocasiona la posición incorrecta frente al computador3. enfermedades que ocasiona la posición incorrecta frente al computador
3. enfermedades que ocasiona la posición incorrecta frente al computador
 
Chateau chalon doc
Chateau chalon docChateau chalon doc
Chateau chalon doc
 

Similar to Data herding

Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core ModuleKatie Gulley
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...7mind
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native CompilationPGConf APAC
 
Continuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsContinuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsMike Brittain
 
Final Report Towards The Fulfillment Of My Course Cs5020 (...
Final Report Towards The Fulfillment Of My Course Cs5020 (...Final Report Towards The Fulfillment Of My Course Cs5020 (...
Final Report Towards The Fulfillment Of My Course Cs5020 (...Julie Kwhl
 
Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)Clarence Ngoh
 
Data Migrations in the App Engine Datastore
Data Migrations in the App Engine DatastoreData Migrations in the App Engine Datastore
Data Migrations in the App Engine DatastoreRyan Morlok
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney
 
Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic ResearchMiklos Koren
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationJonathan Katz
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...HostedbyConfluent
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersDave Stokes
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4openerpwiki
 

Similar to Data herding (20)

Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Continuous Delivery: The Dirty Details
Continuous Delivery: The Dirty DetailsContinuous Delivery: The Dirty Details
Continuous Delivery: The Dirty Details
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Final Report Towards The Fulfillment Of My Course Cs5020 (...
Final Report Towards The Fulfillment Of My Course Cs5020 (...Final Report Towards The Fulfillment Of My Course Cs5020 (...
Final Report Towards The Fulfillment Of My Course Cs5020 (...
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)Having Fun Building Web Applications (Day 2 slides)
Having Fun Building Web Applications (Day 2 slides)
 
Data Migrations in the App Engine Datastore
Data Migrations in the App Engine DatastoreData Migrations in the App Engine Datastore
Data Migrations in the App Engine Datastore
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic Research
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Building a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management ApplicationBuilding a Complex, Real-Time Data Management Application
Building a Complex, Real-Time Data Management Application
 
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
Developing Kafka Streams Applications with Upgradability in Mind with Neil Bu...
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for Developers
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4Open erp technical_memento_v0.6.3_a4
Open erp technical_memento_v0.6.3_a4
 

Data herding

  • 2. Enough About Me Working with Django for a few years. With Lincoln Loop since 2008. Clients include National Geographic, PBS, Nasuni, and redbeacon.
  • 3. The Big Squeeze Companies and organizations that have been operating for at least a few years have amassed large amounts of content and data. For most organizations that also means systems and tools that have been around for a while: • Desktop applications • Expensive, licensed, proprietary applications • Increasing frustration mounts as staff can't operate as they'd like to
  • 4. Agenda Dealing With Data in a Django Environment • Front matter: A Few Tools and Tips • Dealing With a Large Legacy Migration • South in Team Environments
  • 5. Django Data-Oriented Commands • dbshell • dumpdata / loaddata • inspectdb • flush / reset • sql, sqlall, sqlclear, sqlcustom, sqlflush, sqlindexes, sqlreset sqlsequencereset --database option
  • 6. Fixtures Natural Keys - new feature in Django 1.2 Helps make fixtures much more portable.
  • 7. Customizing Data Installation • post-syncdb • sqlcustom ◦ appname/sql/modelname.sql Example: • More powerful syncdb ◦ Handle migrations, sync, and other fixture installations all in one-shot
  • 8. iPython Tips 1. shell_plus 2. Input/Output Caches 3. Macros
  • 9. shell_plus Part of django-extensions - autoloads your Django models into the interpreter namespace. 1 python bin/manage.py shell_plus 2 From 'auth' autoload: Permission, Group, User, Message 3 From 'contenttypes' autoload: ContentType 4 From 'sessions' autoload: Session 5 From 'sites' autoload: Site 6 From 'admin' autoload: LogEntry 7 From 'redirects' autoload: Redirect 8 From 'south' autoload: MigrationHistory 9 From 'categories' autoload: Category 10 From 'content' autoload: Page, Content, Guide, WikiName, WikiLink, Template, Attraction 11 From 'menus' autoload: Menu, MenuItemGroup, MenuItem
  • 10. Input Cache The input cache lets you access and eval previously input commands in a flexible manner. 1 In [5]: _i3
  • 11. Output Cache The output cache lets you access the results of previous statements. 1 In [23]: range(5) 2 Out [23]: [0, 1, 2, 3, 4] 3 ... 4 In [30]: _23 5 Out [30]: [0, 1, 2, 3, 4]
  • 12. Large Output and the Output Cache You can suppress output using a ';' at the end of a line. The output cache prevents Python's garbage collector from removing previous results. This can quickly use up memory. Use the cache_size setting to bump or down the cache (including 0 to disable).
  • 13. Macros You can easily use the %macro feature to capture previous input lines into a single command. %hist is handy for this. You can also %store your macros so you have them available in the iPython namespace across sessions.
  • 14. Macros as Workflow Commands Even seen this? 1 SQL Error: 1452: Cannot add OR UPDATE a child row: 2 a FOREIGN key CONSTRAINT fails (`myapp`.`categories`, 3 CONSTRAINT `fk_categories_categories` 4 FOREIGN KEY (`parent_id`) REFERENCES `categories` (`id`) 5 ON DELETE NO ACTION ON UPDATE NO ACTION)
  • 15. Macros as Workflow Commands Mix shell commands and Python: 1 In[1]: !echo "set foreign_key_checks=0; drop table customers" | manage.py dbshell" 2 In[2]: Accounts.objects.delete() 3 In[3]: import sys 4 In[4]: print "I'm getting distracted" 5 In[5]: reload(sys.modules['apps.tickets']) 6 In[6]: %macro resetaccts 7 In[7]: %store resetaccts 8 ... 9 In[45]: resetaccts 10 In[46]: migrate
  • 16. Agenda Dealing With Data in a Django Environment • Front matter: A Few Tools and Tips • Dealing With a Large Legacy Migration • South in Team Environments
  • 17. Legacy Data Migration Moving data from a big ol' crufty database to your shiny new Django application can be a wedge in agile development processes.
  • 18. Pre-requisites • Production snapshot or reasonable imitation (this is often a PIA) • People who know the legacy schema and all the convulted business requirements
  • 19. Get a Bird's Eye View Not all data is created alike. • Table inventory • Table data types • Tables with several relations
  • 20. Getting Started • inspectdb • ORM Multi-DB support • Create an app dedicated to the legacy schema • Give your project a migration role (settings file dedicated to the migration)
  • 21. Migration Project Role 1 #migration_settings.py 2 from settings import * 3 4 DATABASES = {} 5 DATABASES['default'] = { 6 'NAME': 'newscompany', 7 'ENGINE': 'django.db.backends.mysql', 8 'USER': 'joe', 9 'PASSWORD': 'schmoe', 10 } 11 12 DATABASES['legacy'] = { 13 'NAME': 'newscompany_legacy', 14 'ENGINE': 'django.db.backends.mysql', 15 'USER': 'root', 16 'PASSWORD': 'youllneverguess', 17 }
  • 22. Migration Project Role (continued) 1 DATABASE_ROUTERS = ['apps.legacy.db_router.LegacyRouter',] 2 3 INSTALLED_APPS += ( 4 'apps.legacy', 5 ) 6 7 DEBUG = False
  • 25. I wouldn't use South for this type of project -Andrew Godwin, South
  • 26. Let's Write a Migration! We'd like to start pulling legacy Articles into the new application. Running inspectdb has given us this: 1 #apps.legacy.models.py 2 3 class News_Section: 4 name = models.CharField(max_length=50) 5 ... 6 7 class News_Topic: 8 title = models.CharField(max_length=50) 9 ... 10 11 class News_Article(models.Model): 12 section = models.ForeignKey(Section) 13 topic = models.ForeignKey(Topic) 14 ...
  • 27. Our New Schema We've decided that the legacy schema works pretty well here and we'll only make minor modifications 1 #apps.content.models.py 2 3 class Section: 4 name = models.CharField(max_length=50) 5 ... 6 7 class Topic: 8 name = models.CharField(max_length=50) 9 ... 10 11 class Article(models.Model): 12 section = models.ForeignKey(Section) 13 topic = models.ForeignKey(Topic) 14 ...
  • 28. Mapping Table-to-Table First we'll migrate Sections 1 #apps/legacy/migrations/sections.py 2 from apps.legacy import models as legacy_models 3 from apps.content.models import Section 4 5 for s in legacy_models.News_Section.objects.all(): 6 section = Section(name=s.name) 7 section.save()
  • 29. Another Table-to-Table Next we handle Topics: 1 #apps/legacy/migrations/sections.py 2 from apps.legacy import models as legacy_models 3 from apps.content.models import Topic 4 5 for t in legacy_models.News_Topic.objects.all(): 6 topic = Topic() 7 8 #map to new field name 9 topic.name = t.title 10 11 topic.save() We're on a roll now!
  • 30. Generalizing Table-to-Table For these types of one-to-one mappings it is easy to generalize the operation and reduce a bunch of repetitive code. (Declarative code FTW!) 1 def map_table(field_map, src, dst): 2 for src_col, dst_col in field_map: 3 setattr(dst, dst_col, getattr(src, src_col)) 4 return dst 5 6 MAP = ( ('name', 'name',) , 7 ('title', 'headline', ), 8 ('create_date', 'date_created', ) ) 9 10 for some_old_object in legacy_models.SomeOldModel.objects.all(): 11 map_table(MAP, some_old_object, ShinyNewObject()).save()
  • 31. Migrating Tables with Relations Now, time for the Article. In this case we have foreign keys to fill so our table-mapping pattern won't get us all the way there. We'll need to account for the relations manually. 1 for a in legacy_models.News_Articles.all(): 2 article = Article() 3 article.headline = a.headline 4 5 #get the Section 6 Section.objects.get(...) #OOPS! How do we know which one? A snag. We need to lookup the Section in the new database that corresponds to the old Article's old News_Section. How can we reliably tell which one?
  • 32. Preserve a Few Legacy Bits We need to be able to tell which row in the legacy DB an object in the new system came from. In addition to keeping the original ID, you may want to preserve other fields even if you don't have a definite plan for it. Reasons in support of: • Reports and historical documents might reference legacy IDs • Affiliate APIs, 3rd party, businesses partners might have legacy IDs in their systems • New business rules deprecate some data, but the data might still be useful in the future I'm in favor of sticking these directly on the models, unless it is more than a few extra fields.
  • 33. Introducing Models with Baggage Our new Section, Topics and Article Models 1 #apps.legacy.content.py 2 3 class Section: 4 legacy_id = models.IntegerField() 5 ... 6 7 class Topic: 8 legacy_id = models.IntegerField() 9 ... 10 11 class Article(models.Model): 12 section = models.ForeignKey(Section) 13 topic = models.ForeignKey(Topic) 14 legacy_id = models.IntegerField() 15 ...
  • 34. Articles Migration: Take 2 1 for a in legacy_models.News_Articles.all(): 2 article = Article() 3 article.headline = a.headline 4 5 #get the Section 6 section = Section.objects.get(legacy_id=a.section.id) 7 article.section = section 8 9 #get the Topic 10 topic = Topic.objects.get(legacy_id=a.topic.id) 11 article.save() OK, so we're getting close. Now we just rinse and repeat with the rest of the tables in the DB.
  • 35. Reality Check Turns out the legacy system contains 75,000 Articles.
  • 36. Reality Check Turns out the legacy system contains 75,000 Articles. By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 user accounts, the user activity stream, the media assets, and a few other things.
  • 37. Reality Check Turns out the legacy system contains 75,000 Articles. By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 user accounts, the user activity stream, the media assets, and a few other things. Also, the Articles mapping is going to need work because there are different article "types" that were shoehorned into the system.
  • 38. What Are We Up Against? Let's look at our Article migration again: 1 for a in legacy_models.News_Articles.all(): 2 ... How big of a QuerySet can we actually handle? Beyond a few thousand objects things might get dicey. (Don't try this at home)
  • 39. What Are We Up Against? We can switch to using the ModelManager.iterator: 1 for a in legacy_models.News_Articles.iterator(): 2 ... Memory crisis averted! (Not really...)
  • 40. What Are We Up Against? But wait, now we're making at one query per News_Article: 1 for a in legacy_models.News_Articles.iterator(): 2 ... 75,000 article queries
  • 41. What Are We Up Against? 1 for a in legacy_models.News_Articles.iterator(): 2 article = Article() 3 article.headline = a.headline 4 5 #get the Section 6 section = Section.objects.get(legacy_id=a.section.id) 7 article.section = section 75,000 section queries
  • 42. What Are We Up Against? 1 for a in legacy_models.News_Articles.iterator(): 2 article = Article() 3 article.headline = a.headline 4 5 #get the Section 6 section = Section.objects.get(legacy_id=a.section.id) 7 article.section = section 8 9 #get the Topic 10 topic = Topic.objects.get(legacy_id=a.topic.id) 11 article.save() 75,000 topic queries
  • 43. Survey Says... • 1 Huge News_Article query • 75,000 Section queries • 75,000 Topic queries • 150,000 total queries 5ms connection latency = 750 sec = 12.5 minutes just in network latency
  • 44. Now What? 1. ModelManager.all() is problematic once row counts get into the thousands.
  • 45. Now What? 1. ModelManager.all() is problematic once row counts get into the thousands. 2. ModelManager.iterator() might not be enough.
  • 46. Now What? 1. ModelManager.all() is problematic once row counts get into the thousands. 2. ModelManager.iterator() might not be enough. Neither option is very attractive as we deal with large tables.
  • 47. Let's Write a Smarter Migration What should guide our decision making? • Step out of web request mental mode. We're optimizing for different needs here. • Want to maximize Write throughput to the application DB • Want to maximize Read throughput from the legacy DB • Cram the RAM • Maximize Objects per query • Fill the connection packets • Migration jobs as atomic units of work
  • 48. KISS • Database performance features (delayed inserts, etc) • Database import/export data formats (Postgres COPY, MySQL LOAD INFILE) Don't worry about getting too exotic until you've maxed out other options. A well designed job system will give you a ton of mileage.
  • 49. Difference Makers Don't Work Blind. Make sure you know how to: • View the query log • Profile queries, measure throughput Also helps: • Disable indexes on the new database until after the migration is done • Turn on connection compression if your client/server support it and you're going over the wire.
  • 50. Bring Friends Grab some handy tools: • SQL Editor, GUI Console (if you're not a CLI ninja) • Maatkit • SqlAlchemy
  • 51. A Few Key Features 1. Pause / Graceful Stop 2. Resume 3. Timing 4. Logging 5. Partial Jobs 6. Strict Mode vs. Continue on Fail
  • 52. Graceful Stop The ability to cancel the process and leave data in a consistent state.
  • 53. Resume The ability to restart the process from a specific point.
  • 54. Timing The ability to record how long a job takes.
  • 55. Logging The ability to record what was done, and what went wrong
  • 56. Partial Jobs The ability to run a job against a single row, a ranges of rows, or a single table
  • 57. Strict Mode The ability to have the migration ignore errors (log them of course) or stop on any exception
  • 58. Articles Migration: Take 3 One more look at our naive first stab at it: 1 for a in legacy_models.News_Articles.all(): 2 article = Article() 3 article.headline = a.headline 4 5 #get the Section 6 section = Section.objects.get(legacy_id=a.section.id) 7 article.section = section 8 9 #get the Topic 10 topic = Topic.objects.get(legacy_id=a.topic.id) 11 article.save() Problems: • The legacy articles query is pulling one at a time • Even when we fix that, we can still only deal with one row at a time because we need to query the Section and Topic per old Article. • Not easy to make this work in parallel.
  • 59. Work in Batches We can mediate between the "all-or-little" extremes of all() and iterator() using set batch sizes. Take a guess at a reasonable batch size. 1000 rows should be a reasonable starting point for most situations.
  • 60. A More Declarative Style We can move the work of mapping rows out to a runner script. 1 from apps.content.models import Topic 2 from legacy.migration.runner import MigrationBase 3 from apps.legacy.models import News_Topic 4 5 class Migration(MigrationBase): 6 model = Section 7 legacy_model = News_Section 8 9 # legacy application 10 MAP = (('title', 'name',)) This lets us vary the batch size and resource usage independently of the individual jobs.
  • 61. Non-trivial Row Transformations 1 from apps.content.models import Topic 2 from legacy.migration.runner import MigrationBase 3 from apps.legacy.models import News_Topic 4 5 class Migration(MigrationBase): 6 model = Section 7 legacy_model = News_Section 8 9 # legacy application 10 MAP = (('title', 'name',)) 11 12 def process_row(self, row): 13 return (row.title.upper())
  • 62. Job Objects Generalize construction of multi-value INSERT statements 1 class MigrationBase(object): 2 3 @property 4 def column_list(self): 5 return ','.join(self.MAP) 6 7 @property 8 def values_placeholder(self): 9 return 'DEFAULT,' + ','.join(['%s']*len(self.MAP)) 10 11 @property 12 def insert_stmt(self): 13 return "INSERT INTO %s VALUES (%s)" % (self.model._meta.db_table, 14 self.values_placeholder, )
  • 63. Handling Related Tables A little extra work since we need to collect the legacy IDs. Remember how we needed to look up the related model in the new database using the legacy ID? 1 for a in legacy_models.News_Articles.all(): 2 article = Article() 3 article.headline = a.headline 4 5 #get the Section 6 Section.objects.get(...) #OOPS! How do we know which one?
  • 64. Related Tables Pattern 1 class Migration(MigrationBase): 2 model = Article 3 legacy_model = News_Article 4 related = [{'model': Section, 5 'map_index': 1}] 6 7 MAP = ('title', 8 'section_id', 9 'legacy_id', ) 10 11 def process_row(self, row): 12 return {'values': [row.title, 13 None, 14 row.id], 15 'Section': row.section_id} Now we can grab the related objects in a batch (one query), apply the correct new IDs in bulk, and preserve our batch INSERT for the new objects.
  • 65. Progress Report In a real-world example, a job that was taking a few minutes was reduced to less than a second.
  • 66. Run in Parallel The advantage of the atomic-style jobs is that they can run independently. This means we can use Queue from multiprocessing and run jobs in parallel. • A little added complexity since we need to make sure related tables are finished for many tables.
  • 67. No Mercy Write Throughput If you need big league performace, replace the local Queue with Celery. Now we can also run jobs on multiple network nodes and even use multiple copies of the legacy DB for improved read throughput. We can also write to multiple application DBs for increased write throughput. Merge them at the end. Use cloud servers (EC2 / Rackspace)
  • 68. In Case You're Still Bored Other lovely things you'll run into: • Weird primary keys • Data inconsistencies • Creative user solutions to limitations in the old schema • All kinds of special conditions and edge cases • Mismatched data types, abused data types
  • 69. Outside the Confines of Tidy Talk Examples Don't get stuck in the Django ORM tunnel. This is a very appropriate domain for using alternative approaches. • ModelManager.raw() • Using the cursor will let you write more elegant JOIN queries • SqlAlchemy / Unit of Work
  • 70. Agenda Dealing With Data in a Django Environment • Front matter: A Few Tools and Tips • Dealing With a Large Legacy Migration • South in Team Environments
  • 71. Managing Migrations • South has good momentum and good intentions • Does a job well and gets out of your way • Best if everyone has a decent understanding of how it works
  • 72. Common Complaints 1. Merging branches brings migration conflicts 2. Two team members create identically numbered migrations Turns out the solution is to Talk To Your Teammates!
  • 73. Other Solutions ChronicDB is a new product with an innovative approach to schema migrations. Built in Python and C. A free version is available for small databases.
  • 74. Questions? Brian Luft - @unbracketed @zen_of_python Thank you for your attention.