[PonyORM-list] Inheritance

Alexander Kozlovsky alexander.kozlovsky at gmail.com
Fri Sep 26 16:37:42 UTC 2014


Hi Jake!

At first I thought: "This is the great idea, we definitely should add such
index". But then I start to analyze for which queries the index specified
on the discriminator column (that is, on the `classtype` column) can
improve performance. And it turns out that for most queries this index
would not be useful: either because full table scan would be cheaper, or
because the table already has some better index.

Let's consider some straw-man example to illustrate this. Let's say we have
three classes: Person, Student and Teacher, and both Student and Teacher
classes are inherited from the Person class:

    class Person(db.Entity):
        id = PrimaryKey(int, auto=True)
        name = Required(unicode)
        age = Required(int)

    class Student(Person):
        group = Required(Group)
        gpa = Required(float)

    class Teacher(Person):
        dept = Required(Department)
        degree = Required(unicode)

I don't describe Group and Department classes here, because they are
irrelevant to the discussion.

To have some specific numbers, let's say we have a database which consist
of 100 000 student and 1000 teachers. Each physical database page holds 50
records of persons, and the record of Student have the same size as the
record of Teacher. These numbers are simplification, and the goal is not to
be entire realistic, but to get some general estimations. With these
numbers, the database will have 2020 pages of data. The records of students
and teachers will be intermixed in the same table. If teacher records were
distributed completely evenly, then about 1000 pages will have teacher
records, one teacher record on each page. But let's say the teacher records
are stored more compactly, and reside in 100 database pages. Each of this
pages holds 10 teacher records and 40 student records.

So, we have:
- 2020 pages of data;
- 50 records per page;
- Each page holds student records, and some pages holds teacher records
also;
- 100 pages with teacher records (10 teacher records and 40 student records
per page).

####################################################################

Now, I want to list some example queries, for which we can analyze if the
index on the discriminator column will give any benefits:

1) Access by primary key

    s = Student[123]
    t = Teacher[456]

2) Retrieve all instances of the specific class:

    select(s for s in Student)
    select(t for t in Teacher)

3) Retrieve instances of specific class with using some additional criteria:

    select(s for s in Student if s.group.number == 101)
    select(s for s in Student if s.gpa > 4)
    select(s for s in Student if s.group.number == 101 and s.gpa > 4)
    select(t for t in Teacher if t.dept.number == 2)
    select(t for t in Teacher if t.degree == "Professor" and t.age > 60)

4) Aggregations

    # max gpa for each group of students
    select((s.group.number, max(s.gpa)) for s in Student)

    # average professor age for each department
    select((t.dept.number, avg(t.age)) for t in Teacher if t.degree ==
"Professor")

Now, let's analyze which queries may get speedup from the additional index
on the discriminator column.

####################################################################

1) Access by primary key. This type of queries will not be benefited from
new index, because access by PK is already the most efficient.

2) Retrieve all instances of specific class

    select(s for s in Student)

The student records are placed in all data pages. The most efficient way to
retrieve all student is to perform full table scan. New index on the
discriminator column will not speedup this query.

    select(t for t in Teacher)

There are two ways to perform this query: (1) full table scan and (2)
index-based access using the index on the discriminator column. The full
table scan uses sequential access to the disk, and the index-based query
uses random access. For traditional drives with rotated disks the random
access is around 100 times slower then the sequential access. Because of
this, in order to obtain benefits from the index-based scan, the pages with
the teacher's records should represent less then 1% of all pages. In our
case, the teacher records spread across 10% of all pages, so full table
scan will be much faster then index-based query.

For SSD drives the difference between sequential and random access is not
so big, the random access may be just 5-10 time slower then the sequential
access. Hence, for SSD drives, index-based access starts to be more
efficient when teacher records placed in less then 10-20% of the pages. In
our example, teacher records placed in 10% of all pages, so, when using SSD
drive, the index-based query may be a bit more efficient then the
full-table scan.

3) Retrieve instances of specific class with using some additional criteria:

    select(s for s in Student if s.group.number == 101)

The new index on discriminator column will not give any benefits here,
because we already have much better index based on the `group_id` column.

    select(s for s in Student if s.gpa > 4)

The most efficient way to do this query is to perform full-table scan. The
index on the discriminator column will not be useful here.

    select(s for s in Student if s.group.number == 101 and s.gpa > 4)

The existing index on `group_id` column will work very good for this query,
and the index based on the discriminator column will be slower then the
full-table scan.

    select(t for t in Teacher if t.dept.number == 2)

The most efficient way to perform this query is to use existing index on
the `dept_id` column.

    select(t for t in Teacher if t.degree == "Professor" and t.age > 60)

The most efficient way to perform this query is to use specialized
composite index based on columns `degree` and `age`. Without this index,
the most efficient way for HDD drive will be the full-table scan, and for
SSD the index based on the discriminator column may be a bit better then
the full-table scan.

4) Aggregations

    # max gpa for each group of students
    select((s.group.number, max(s.gpa)) for s in Student)

The most efficient way is to use composite index based on columns
`group_id` and `gpa`. This way the query may be performed as index-only
scan. The next best way is to use existing index based on the `group_id`
column.

    # average professor age for each department
    select((t.dept.number, avg(t.age)) for t in Teacher if t.degree ==
"Professor")

The most efficient way is to use composite index based on columns `dept_id`
and `age`. This way the query may be performed as index-only scan. The next
best way is to use existing index based on the `dept_id` column.

####################################################################

To summarize: this is the queries which *may* perform marginally better on
SSD drives when the table have the index based on the discriminator column:

1) All instances of the specific subclass which have small total number of
instances:

    select(t for t in Teacher)

2) Filtering the subclass with small number of instances using the criteria
based on some non-indexed columns:

    select(t for t in Teacher if t.degree == "Professor" and t.age > 60)

But the additional index on discriminator column will benefit this
queries only if records of this specific class are placed in less then 1%
of total number of pages (for HDD drives), or less then 10% of total number
of pages (for SSD drives).

All other analyzed queries will not benefit from the suggested index,
either because full-table scan works better, or because the table already
have much better index.

Looking on these numbers, I don't think that we should automatically index
discriminator column - the benefit of this is just not too big, and
maintaining additional index will slow down inserts. Anyway, if somebody
want to add index based on the discriminator column it is relatively easy
to do, you just need to define discriminator column explicitly and make it
indexed. It may be done this way:

    class Person(db.Entity):
        id = PrimaryKey(int, auto=True)
        classtype = Discriminator(str, index="classtype_idx")
        name = Required(unicode)
        age = Required(int)


Regards,
Alexander Kozlovsky


On Tue, Sep 23, 2014 at 4:28 AM, Jake Austwick <jake.austwick at gmail.com>
wrote:

>  When you inherit one model from another within Pony, it uses the method
> of just having one table and storing the class type in the `class type`
> column. Would it not be sensible for this column to automatically be
> indexed? It's clearly going to be used in queries to separate out the
> different types of classes, otherwise inheritance wouldn't have been used
> in the first place.
>
> Thoughts?
>
> Thanks,
> ------------------
> Jake
> http://jakeaustwick.me/
>
> _______________________________________________
> ponyorm-list mailing list
> ponyorm-list at ponyorm.com
> /ponyorm-list
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: </ponyorm-list/attachments/20140926/ce80f3de/attachment.html>


More information about the ponyorm-list mailing list