banner image 1 banner image 2

Understanding Partitions in MySQL

February 3, 2022
9 mins
command
blog-img 1
Sai Viswanathan
Author

How to manage huge datasets and large inflow of data

By Sai Viswanathan, who doesn’t just breath air, he breathes programming.


Introduction

Imagine a platform like Instagram/Facebook/Amazon with millions of users and tons of day-to-day activities done by them. How data flows in such platforms? How do they manage their huge datasets? How data will be persisted, accessed and distributed across servers? These are the real-world problems and every business firms solves these problems in their own suitable way with the help of various database techniques such as Master-Slave db, Distributed data access, Sharding, Partitioning etc., Some usecases needs intense caching, some needs realtime data so caching won’t work all the time. Some usecases needs powerful read operation, some needs write operations and both at times. Based on the problem that you are facing and what you are trying to achieve that decides what technique you should follow.

This article has been published by Instagram on 2012 where they explained about the problems that they face while handling huge datasets, high resolution images and explained how they resolved it using Sharding technique. Similarly, any business firm that deals with customer data and their frequent day-to-day actions (like placing an order on E-commerce application, sharing a post/image on Social Media) will have to go through a journey in handling and efficiently maintaining huge dataset. CaratLane is no exception to it as we are getting more and more orders everyday, we are maintaining huge dataset such as order related data, Item related data, data that are related to minor components in an item/SKU. Since we also store some portion of the ERP data, you can imagine the volume of data that we deal with everyday.

We are facing a problem where some of our table sizes are growing huge and such instances will occur more in future as our data flow increases. When you talk about table size, there are various factors that decides it such as Operating system restrictions, limits applied by storage engine but no hard-limit kept by MySQL itself.

Why?

When you query such huge dataset, entire data has to be kept in memory. Even with indexes, it only creates overhead as B-Tree indexes won’t work well for larger tables and cost of maintaning indexes is also very high especially when you have more indexes on your table. Table partitioning technique comes to the rescue for such cases.

What?

Partitioning a table means that splitting the super set data into multiple subsets (partitions). But, you will not be able to visualize it like how you see your ordinary tables. There is a special query you will have to run to see the list of partitions. So, partitioned table is logically a single table which is composed of multiple physical sub-tables.

When?

Partitions can be beneficial when

  1. your table size is huge and too big to fit in your memory.
  2. you want to backup and restore a certain portion of data than huge dataset
  3. you want to distribute your load physically to other servers. But, partitions are generally maintained inside same database server.

How partitioning works?

As mentioned above, partitioned tables have multiple sub-tables. You can’t see/access the partitions directly. But, your storage engine don’t see your table/partition as two different things instead it only considers partition as a table. When you apply indexes on the partitioned table, it will be implemented as identical indexes over each partition. When you create partitions on a table, you must mention partition type and optionally the number of partitions.

CREATE TABLE partitionedtable (id INT NOT NULL, orderno VARCHAR (15), created_at TIMESTAMP NOT NULL)

PARTITION BY <partition-type>

Let’s see how partitions work on the following query statements.

SELECT query — When you query partitioned table, query optimizer sees if any partition can be ignored for querying and delegates the query to the respective partition which is based on the partition type that we choose.

INSERT query — When you insert a row into the table, partitioning layer determines which partition should receive this row and sends the row to that partition.

UPDATE query — When you update a row in the table, partitioning layer determines which partition contains the row, fetches and modifies it and determines which partition should contain the new row, forwards the row with an insertion request to the destination partition, and forwards the deletion request to the source partition.

DELETE query — When you delete a row from the table, partitioning layer determines which partition contains the row and forwards deletion request to that partition.

Types of Partitions

  • Range based partition

Partition type must be mentioned as RANGE for this type of partition. When a row is inserted, expression YEAR(orderdate) will be evaluated and based on the resultant value, destination partition will be decided. Disadavantage of this partition type is that all range of partitions must be mentioned. Otherwise, you must mention default partition if data doesn’t fall in any range.

CREATE TABLE sales_reports (order_date DATETIME NOT NULL)

PARTITION BY RANGE(YEAR(order_date)) (

PARTITION old_reports VALUES LESS THAN 2016),

PARTITION report_2017 VALUES LESS THAN (2017),

PARTITION report_2018 VALUES LESS THAN (2018),

PARTITION report_2019 VALUES LESS THAN (2019),

PARTITION new_reports VALUES LESS THAN MAXVALUE);

  • Key based partition

Hashing logic will be supplied by MySQL server for Key based partition and partition type must be mentioned as KEY. Any columns used as the partitioning key must comprise part or all of the table’s primary key. When no column is specified for key partition, primary key or unique key will be considered on which partition will be made.

CREATE TABLE sales_reports (id INT NOT NULL PRIMARY KEY)

PARTITION BY KEY();

  • Hash based partition

Another type of partition is Hash-based partitioning. Let’s say you have a table with an autoincrementing id primary key, but you want to partition the data temporally so the recent data is clustered together. You can’t partition by a timestamp column unless you include it in the primary key, but that defeats the purpose of a primary key. You can partition by an expression such as HASH(id DIV 1000000) , which creates a new partition for each million rows inserted. This achieves the goal without requiring you to change the primary key. It has the added benefit that you don’t need to constantly create partitions to hold new ranges of dates, as you’d need to do with range-based partitioning.

CREATE TABLE sales_reports (id INT NOT NULL PRIMARY KEY)

PARTITION BY HASH(id DIV 1000000);

Also, there are other types of partitioning such as

  • LIST partitioning
  • COLUMNS partitioning
  • Subpartitioning

Exploring partitions

To check the working of partition, I took one live sample table whose data size is considerably huge. We have a table called erp_trans_item_formula_dtl which contains the erp formula for each trans_item_id. For each item id, there will be multiple rows of formula details. So, this table will grow huge as and when orders and its items increase. Current size of this table is around 4.5L. I was exploring on stage DB environment on how to partition the tables in MySQL.

I’ve tried a hash-based partition for my analysis as range based partition works on date ranges and due to this reason we might want to add a new partition as and when year/month increments. Hash-based partition works based on the hash value of the specified column and MySQL server determines which partition it should go to by computing the hash value of the column. For testing purpose, I’ve copied the erp_trans_item_formula_dtl table and named it as erp_trans_item_formula_dtl_1 table.

Note: No indexes kept in this table.

Since this table is already created, I’ve created the partitions using the ALTER query as below.

ALTER table erp_trans_item_formula_dtl_1 PARTITION BY HASH(trans_item_id) PARTITIONS 10;

You can’t see the partitions physically as smaller subtables but you can only see its existence logically. Below query tells you how many partitions in a table.

SELECT * FROM information_schema.partitions WHERE TABLE_SCHEMA=’caratlane_16_feb_17' AND TABLE_NAME = ‘erp_trans_item_formula_dtl_1’ AND PARTITION_NAME IS NOT NULL

Your query will not undergo any change to do read/write operations on this partitioned table. As you have seen in the ALTER statement, partitioning is happening based on hash computation of the trans_item_id column. If the WHERE clause of your query contains a column which is used for partition clause, then MySQL server will only scan the specific partition and return the desired result as the indexes will be maintained for each partition. If not, the server will end up scanning all partitions and it consumes more time for scanning the records. (It’s similar to querying huge table without indexes and partitions)

Querying on partition column

explain partitions select * from erp_trans_item_formula_dtl_1 where trans_item_id=15565966;

When I analyze this query using the EXPLAIN statement, you may notice that it scans only partition 6 (p6). Refer to the partitions column as shown in Fig 2.

Querying on non-partition column

explain partitions select * from erp_trans_item_formula_dtl_1 where id=3060;

When you query on a non-partition column, all 10 partitions are scanned which means the entire table. Refer to the partitions column as shown in Fig 3.

Key points

  • You can have 1024 partitions per table.
  • You can’t see the partitions physically as smaller subtables but you can only logically see its existence. Below query tells you how many partitions in a table.
  • Your query must contain a partition column and also your index must include a partition column to yield faster results.
  • Since partitioning is just a logical separation and we are not sharding it into different servers, disk storage will still be occupied as and when the data increases.
  • NULLs can create problem in range-based partitioning when you use YEAR() or TO_DAYS() expression if it evaluates to NULL. So, any range query will check two partitions one for actual date range partition and one for default partition which NULL matches

[embed]https://medium.com/@saivish91[/embed]
We at CaratLane are solving some of the most intriguing challenges to make our mark in the relatively uncharted omnichannel jewellery industry. If you are interested in tackling such obstacles, feel free to drop your updated resume/CV to careers@caratlane.com!
blog-img 2

Discussions

blog-img 3
5 mins
May 17, 2023
Sharing Data Between Controllers: Best Practices S...

This article will help you to understand the diffe

By Naveen C

blog-img 3
5 mins
March 21, 2023
Understanding Auto Layout and Constraints in Swift...

This article gives you an easy way of understandin

By Ramasamy P