I have been exploring how disk-oriented databases efficiently move data in and out of disk. One way, that I explored in Discovering and exploring mmap using Go and But how, exactly, databases use mmap?, is through memory-mapped files. Although mmap is a really neat solution, it has some troubles. Most troubles come from the fact that the database has no control of how pages are flushed to disk since that job is carried through the OS. Because of that most databases avoid using mmap. However, they still need to read and write data from disk in an efficient manner. And the answer to that is: buffer pool.

In this post, we'll explain how a buffer pool manager works and how to implement one in Go. But first, let's do a simple recap of how data is structured in disk.

How a database file is structured?

The file in disk that stores data of a database is just, in most databases, an array of bytes organized into fixed-length blocks called pages. A database file is a linear array of continuous pages.

A database file is organized into fixed-length blocks called pages

The page length is usually a multiple of the filesystem block length ranging from 512 bytes to 16KB. PostgreSQL, for example, uses 8KB, although a different page size can be selected when compiling the server[1]. InnoDB, a storage engine used in MySQL, uses 16KB for page size[2].

The layout of these pages is not the scope of this post. However, just to give a glimpse of what happens when writes are executed, think that, for row-oriented databases, tuples are appended inside the page until it is filled up. When the page is filled up, another page needs to be allocated to support incoming writes.

The buffer pool manager

So we have our data in disk and the database wants to access it, let's say, through a SELECT command. Or we have new incoming data, that we want to store. In order to manage reads and writes efficiently, the common practice is to avoid touching the disk directly. The execution engine that is processing the commands has no access to disk. It has access to an in-memory intermediate layer abstraction called buffer pool.

High level architecture of how data is accessed by the execution engine

The execution engine requests pages from the buffer pool by providing a page identifier. It can read the content of the page and send the data to the client. Or it can modify the content of the page and send it back to the buffer pool so that the change can be persisted. The execution engine access the buffer pool thinking that it is accessing the disk.

I really think the term buffer pool is kind of misleading here, and it confused me a lot. And that's because the purpose of this structure is not only to manage a pool of empty buffers and its reuse but also it is responsible for caching the disk pages into memory for future reuse (not only by the requesting thread but also for concurrent threads). And as with any cache, also responsible for caching replacement policy. For this reason, a more appropriate name for the buffer pool is page cache. With this in mind, you can imagine the importance of the buffer pool for speeding up the database access to disk.

The buffer pool can be implemented with a statically allocated array of frames. Frames start empty and fill up as pages are being requested. There is an associative structure called page table, that indicates which page is being held by each frame.

A representation of a full buffer pool

The above image indicates the two data structures used for implementing a buffer pool: a frame array and a page table. The image represents a buffer pool that can cache a maximum of 4 pages. The pages being cached are 4, 5, 6, and 7. The arrows indicate where the page can be found in memory (in the array of frames). Let's look into another buffer pool:

A representation of a buffer pool with one empty frame

At this, we can see that there are 3 pages in cache, and one space left. The frames that are free and are ready for receiving a new page are tracked in an additional structure called free list.

The pages in red indicate the pages that are being accessed by some thread. That indication is very important because it means that the page cannot leave the buffer pool until all accesses are finished. We will be back to this.

Another important concept is page replacement. What happens when the cache is full, and a page that is not in the cache is requested? We need to evict a page that is not being accessed and give its frame to the requested page. There are many algorithms (or policies) to decide which page to evict. We'll talk about this later.

With these concepts in mind, here is how we can represent our buffer pool in Go code

//BufferPoolManager represents the buffer pool manager
type BufferPoolManager struct {
	diskManager DiskManager
	pages       [MaxPoolSize]*Page
	replacer    *ClockReplacer
	freeList    []FrameID
	pageTable   map[PageID]FrameID
}

//NewBufferPoolManager returns a empty buffer pool manager
func NewBufferPoolManager(DiskManager DiskManager, clockReplacer *ClockReplacer) *BufferPoolManager {
	freeList := make([]FrameID, 0)
	pages := [MaxPoolSize]*Page{}
	for i := 0; i < MaxPoolSize; i++ {
		freeList = append(freeList, FrameID(i))
		pages[FrameID(i)] = nil
	}
	return &BufferPoolManager{DiskManager, pages, clockReplacer, freeList, make(map[PageID]FrameID)}
}

The buffer pool receives a DiskManager implementation and a replacer. In our case, we are going to implement the clock replacement policy, implemented by ClockReplacer, for page eviction. Internally, pages represents the frame array of fixed size (MaxPoolSize); pageTable is a map the indicates which frame holds which page; and freeList tracks the array of free frames. Initially, we have all frames ids inside the freeList, because all frames are free. That also means all frames pointing to nil.

And here is the API provided by our buffer pool implementation:

NewPage() *Page
FetchPage(pageID PageID) *Page
FlushPage(pageID PageID) bool
FlushAllPages()
DeletePage(pageID PageID) error
UnpinPage(pageID PageID, isDirty bool) error
  • NewPage() *Page: a clean new page can be requested. The buffer pool will ask the disk manager for a new page id;
  • FetchPage(pageID PageID) *Page: fetches a known page. If the page is in cache, it is immidiatatly returned. If not, the buffer pool will find it on disk and cache it;
  • FlushPage(pageID PageID) bool: writes the page that is in cache to disk;
  • FlushAllPages() writes all pages in cache to disk;
  • DeletePage(pageID PageID) error: deletes page from cache and disk;
  • UnpinPage(pageID PageID, isDirty bool) error: indicates that the page is not used any more for the current requesting thread. If no more threads are using this page, the page is considered for eviction.

How to represent a page

We are going to represent our page in a very simple manner. Our focus here is to learn how the buffer pool works. Our page is just a []byte followed by an id. There is some metadata that needs to be tracked. One is the pinCount that tracks the number of concurrent accesses. Each time the page is accessed the pinCount is increased. When it is no more in use, the pinCount is decreased. So, if the pinCount is greater than 0, we say the page is pinned and it is indicated in our figures by the red mark. The other metadata variable is isDirty, which indicates that the page has been modified after it was read from disk.

// PageID is the type of the page identifier
type PageID int

const pageSize = 5

// Page represents a page on disk
type Page struct {
	id       PageID
	pinCount int
	isDirty  bool
	data     [pageSize]byte
}

The Clock Replacement algorithm

When the buffer pool is full, and a new page needs to enter the cache, some page needs to be chosen to leave the cache.  This is a classic cache eviction problem and there is no perfect solution because predicting which pages are going to be requested is a complicated problem. If you can't predict which pages are going to be requested, you don't know which pages must remain in cache and which pages must be evicted. The best you can do is to reduce the number of wrong page evictions. There are many strategies like FIFO (first-in first-out), LRU (least recently used), and LFU (least frequently used). We are not going to explore all of them and their trade-offs. For our buffer pool manager implementation, we have chosen one policy called CLOCK.

When a thread from the execution engine no longer needs a page, it calls the UnpinPage method from the buffer pool. This method reduces the pinCount of the page, and if the pin count reaches zero, the frame id of where that page allocated is stored in the Clock Replacer for future analysis. If, anytime, the same page is accessed again (its pin count increases), and it needs to be removed from the Clock Replacer.

The Clock Replacer keeps tracks of all pages that are not being accessed in a circular buffer. For each page, it also keeps a reference bit to indicate if the page was recently accessed. There is a Clock Hand or Clock Pointer (CP) that iterates through the circular buffer looking for a victim.

Clock Replacer representation 

The Clock Replacer provides the following API:

func Victim() *FrameID
func Unpin(id FrameID)
func Pin(id FrameID) 
  • Unpin is called when the buffer pool wants to make a page eligible for eviction because it is no longer accessed;
  • Pin is called when the buffer pool wants to remove a page from cache eviction election because it was accessed and it cannot be evicted;
  • Victim elects a victim through the clock replacement policy.

Let's see how the clock replacement policy works using an example.

The state of the Clock Replacer after the calls Unpin(1), Unpin(2) and Unpin(3).

num ref clock hand
1 1 <-
2 1
3 1

We have 3 frames eligible for eviction. When Victim() is called, the clock replacement algorithm is run. The clock hand iterates through the circular buffer checking the reference bit. If the bit is set to 1 (that means that the frame was recently added to the clock replacer), it changes the bit to 0. Bellow is how the clock replacer state changes when Victim() is run until it finds a victim for eviction.

num ref clock hand
1 0
2 1 <-
3 1
num ref clock hand
1 0
2 0
3 1 <-
num ref clock hand
1 0 <-
2 0
3 0
num ref clock hand
2 0 <-
3 0

If we call Unpin(4), we have

num ref clock hand
2 0 <-
3 0
4 1

And after a call to Victim(), 2 is evicted.

num ref clock hand
3 0 <-
4 1

The clock replacement policy is a very simple implementation that approximates the LRU (least recently used) policy, without the need to need track the timestamp of the last time page was accessed. It can be easily implemented using a circular list.

Using a doubly-linked circular list to implement Clock Replacement

You can check the full implementation at clock_replacer.go.

The Disk Manager

The disk manager is responsible for interacting with the disk. It should be able to read a page from disk, write a page to disk, allocate a new page, and remove a page.

// DiskManager is responsible for interacting with disk
type DiskManager interface {
	ReadPage(PageID) (*Page, error)
	WritePage(*Page) error
	AllocatePage() *PageID
	DeallocatePage(PageID)
}

Since our main concern was the buffer pool, in our demo we implemented an in-memory mock of the disk manager, that can be checked at disk_manager_mock.go.

Implementing the Buffer Pool

Let's dissect the FetchPage, NewPage and ÙnpinPage methods. The other methods are pretty straightforward, and you can check full code at the repo.

Fetching a page

First, we check if the page is in the buffer pool. If we can find the page in the cache, we increase the pin count (to indicate there is one more thread accessing the page) and make sure it not in the clock replacer by calling Pin.

if frameID, ok := b.pageTable[pageID]; ok {
    page := b.pages[frameID]
    page.pinCount++
    (*b.replacer).Pin(frameID)
    return page
}

If it is not in the buffer pool we need to find a frame to allocate this page. The buffer pool can have free frames or it can be full. The getFrameID method handles this. We check our freeList to see if there are any frames available. If we can find an empty frame, we return its id and remove it from freeList. If we can't find an empty frame, we have to elect a frame from the buffer pool to free space by running the clock replace algorithm. The bool indicates if the frame id came from the free list or from the eviction algorithm.

func (b *BufferPoolManager) getFrameID() (*FrameID, bool) {
	if len(b.freeList) > 0 {
		frameID, newFreeList := b.freeList[0], b.freeList[1:]
		b.freeList = newFreeList

		return &frameID, true
	}

	return (*b.replacer).Victim(), false
}

Now that we have a frame id and it didn't come from freeList, we have to remove the page that is allocated in that frame. But if the page was modified (isDirty set to true), then we have to first write it to disk.

if !isFromFreeList {
    // remove page from current frame
    currentPage := b.pages[*frameID]
    if currentPage != nil {
        if currentPage.isDirty {
            b.diskManager.WritePage(currentPage)
        }

        delete(b.pageTable, currentPage.id)
    }
}

Finally, we read the page from disk and allocate it to the chosen frame id

page, err := b.diskManager.ReadPage(pageID)
if err != nil {
    return nil
}
(*page).pinCount = 1
b.pageTable[pageID] = *frameID
b.pages[*frameID] = page

Requesting a new page

Requesting a new page follows the same logic as fetching a page. We need to make sure there is space for a new page in the buffer pool by checking the free list or by replacing a page. After that, the buffer pool requests the disk manager a new page and put it in the cache.

// allocates new page
pageID := b.diskManager.AllocatePage()
if pageID == nil {
    return nil
}
page := &Page{*pageID, 1, false, [pageSize]byte{}}

b.pageTable[*pageID] = *frameID
b.pages[*frameID] = page

Indicating that a page is no longer accessed

After some work is done by a thread on a page, it needs to call the UnpinPage method to indicate the buffer pool that it not using that page anymore. This call is important because it lets the buffer pool decide if that page can be considered for eviction.

The method decreases the pin count of the page. If the pin count reaches 0, it means that there is no thread accessing that page. So, that page is added to the clock replacer to be considered for eviction.

page := b.pages[frameID]
page.DecPinCount()

if page.pinCount <= 0 {
    (*b.replacer).Unpin(frameID)
}

A visualization tool

I was not satisfied with only implementing the buffer pool. I wanted to able to visualize it working. So I went on a mission to build a small JavaScript application that could interact with the buffer pool. I built a web server that instantiated a buffer pool and could receive HTTP requests and send commands to the buffer pool.

Here is a small demo:

Buffer pool demonstration

You can see four initial calls of New Page, loading all frames of our buffer pool. We flush all the pages to disk. Flushing the pages to disk reduces the pin count of all pages by one (that's why they go from red to black). After that, Unpin is called to pages 2 and 3, adding both respective frames to the clock replacer. Delete is called for page 2, removing it from the clock replacer, the buffer pool, and the disk, causing a disk fragmentation. New page is called, and page 4 enters the buffer pool in the free frame 2. New page is called again, but the buffer is full. The clock replacement algorithm runs and frees frame 3 for page 5, evicting page 3. Flush all is called writing pages 4 and 5 to disk. Fetch Page is called on page 0 twice, increasing the pin count by two. That means that the Unpin needs to be called twice, in order to make it eligible for eviction. Last, New Page is called, but the buffer is full again. After the clock replacement algorithm is run, page 6 goes to frame 0.

You can play yourself by following the instructions at the repo.

Final thoughts

I have been studying databases for awhile mainly through the course Intro to Database Systems of Andy Pavlo. In the course, he talks about how mmap is most of the time a bad idea for solving the "data in disk larger than the available memory" problem and introduces the concept of buffer pool. One of the assignments is to implement a buffer pool manager in C++. That's where the idea of implementing one in Go came from.

Even though mmap and buffer pool are two completely different implementations to solve the same problem, I have come to found both very similar. The buffer pool acts like it has all pages in memory, and if the page is not in the cache, it removes one from the cache and fetches the page from the disk putting it on the cache. Pretty much the same logic of the virtual memory that is behind the mmap call. The difference in the buffer pool approach is that everything is in control. The way pages are flushed, locked, and evicted from cache is controlled by the buffer pool. And this kind of control, you completely lose with mmap. And that is why is not to be recommended unless you really know what you are doing.

A buffer pool is supposed to be a shared resource and it is meant to be used by multiples threads doing queries at the same time. In this implementation, it was not our focus to build a thread-safe buffer pool protecting the data structures from race conditions. A thread-safe implementation is planned.

I am on a journey to learn more about databases by implementing parts of one. If you like this kind of content, follow me on Twitter. I will be sharing more related stuff.

Full code at brunocalza/buffer-pool-manager.

Learn more

References

[1] 68.6. Database Page Layout
[2] 14.8.1 InnoDB Startup Configuration