[{"data":1,"prerenderedAt":479},["ShallowReactive",2],{"$dwjjwHb0Gq":3},{"id":4,"title":5,"body":6,"description":331,"extension":472,"meta":473,"navigation":474,"path":475,"seo":476,"stem":477,"__hash__":478},"content\u002Fblog\u002Fchunking-notes.md","Chunking Notes",{"type":7,"value":8,"toc":449},"minimark",[9,23,26,31,34,37,58,66,73,76,78,82,147,152,161,202,207,215,224,228,235,239,241,243,247,250,258,264,267,278,281,296,300,302,304,308,311,319,322,332,335,338,341,347,353,356,365,367,369,373,376,380,382,384,388,391,394,397,400,403,411,413,415,419,422,425,428,435,442,445],[10,11,12],"blockquote",{},[13,14,15,16],"p",{},"This is a draft I wrote up containing some general principles for chunking climate model output. It's not really a finished product yet - just some thoughts about how to chunk climate model output in a way that avoids some of the common headaches. If you have any comments\u002Fsuggestions\u002Fcorrection, please let me know - via the contact page on this site, or via an issue on the ",[17,18,22],"a",{"href":19,"rel":20},"https:\u002F\u002Fgithub.com\u002Fcharles-turner-1\u002FNPCP-Chunk-guide",[21],"nofollow","Github Repository",[24,25],"hr",{},[27,28,30],"h1",{"id":29},"principles-for-chunking-climate-model-output","Principles for chunking climate model output",[13,32,33],{},"This is a draft document aiming to provide some general guidance on deciding chunking for climate model output. There is little intent here to be prescriptive: instead, we aim to provide a set of general principles climate modellers can apply when writing outputs.",[13,35,36],{},"In this document, there are a few overarching principles that we think are important to considers:",[38,39,40,44,52,55],"ul",{},[41,42,43],"li",{},"Climate datasets produced from the output of a model run are generally not contained within a single netCDF file. Instead, they are typically written out as a series of files, often containing the full spatial domain but for a subset of the time domain. Files are not often considered to be a chunk in the typical sense, but represent a fundamental unit of storage and chunking. We aim to describe our principles here at a within file level, but it is important to consider the structure of the files themselves.",[41,45,46,47,51],{},"There is ",[48,49,50],"strong",{},"no such thing as a perfect or optimal chunk scheme",". The optimal chunking scheme for a dataset is fundamentally dependent on the intended analysis to be performed on that dataset. For example, chunking a dataset optimally for producing maps will result in the least optimal chunk scheme for producing time series, and vice versa.",[41,53,54],{},"In this document, we may occasionally refer to 'optimal chunking'. More precisely, what we mean by optimal chunking is the least suboptimal chunking for an unknown use case. That is, we are not trying to optimise chunking for a specific use case, but instead trying to find a chunking scheme that is likely to be reasonably performant for a wide range of use cases.",[41,56,57],{},"Different file formats (eg. netCDF, HDF, Zarr) have different limitations on how they may chunk data. We will endeavour to provide guidance that is applicable across file formats. However, it is important to note that Zarr and virtualisation technologies require more prescriptive\u002Fless flexible chunking schemes than can be created via the concatenation of netCDF files. We will, as a hard rule, avoid recommending chunking schemes that are incompatible with Zarr and virtualisation. This is necessary to ensure the future proofing of datasets, and to ensure that they can be easily converted to Zarr if desired.",[13,59,60,61,65],{},"In this document, we will, in general, refer to chunking using terminology that is common to analyis workflows. These are typically in Python in modern workflows. The ",[62,63,64],"code",{},"dask"," library is ubiquitous when working with local chunked datasets, and so we use it as our reference implementation of chunking.",[13,67,68,69,72],{},"As this document is intended for the NPCP, we will also assume users are familiar with the ",[62,70,71],{},"xarray"," data model, which is similarly ubiquitous in modern analysis workflows.",[13,74,75],{},"In the interests of brevity, we will not consider codecs, compression, or anything like that. They are all important considerations too and chunking plays a role in their effectiveness, but we would like this document to end eventually.",[24,77],{},[27,79,81],{"id":80},"key-concepts","Key Concepts",[38,83,84,90,96,102,108,114,120,129,135,141],{},[41,85,86,89],{},[48,87,88],{},"Chunking",": The process of dividing a dataset into smaller, more manageable pieces (chunks) for efficient storage and retrieval.",[41,91,92,95],{},[48,93,94],{},"Header Data",": Metadata that describes the structure and content of the dataset, including information about variables, dimensions, and attributes.",[41,97,98,101],{},[48,99,100],{},"Rectilinear Chunk Grid",": A chunking scheme where the dataset is divided into regular, rectangular chunks along its dimensions. The last chunk along each dimension may be smaller if the total size of the dimension is not perfectly divisible by the chunk size.",[41,103,104,107],{},[48,105,106],{},"Optimal Chunking",": A chunking scheme that is designed to be efficient for a wide range of use cases, even if it is not perfectly optimized for any specific use case. It aims to balance the trade-offs between different access patterns and analysis needs.",[41,109,110,113],{},[48,111,112],{},"Isotropic Chunking",": A chunking scheme where each dimension is composed of an approximately equal number of chunks. For example, a grid of size 360x180x50, divided into chunks of size 36x18x5, would be considered isotropic, as each dimension is divided into 10 chunks.",[41,115,116,119],{},[48,117,118],{},"Disk\u002F File Chunks",": The size of a chunk on disk. Typically these are relatively small - on the order of kilobytes to a few megabytes.",[41,121,122,125,126,128],{},[48,123,124],{},"Dask Chunks",": The size of a dask chunk in memory, when loading a chunked dataset using the ",[62,127,64],{}," library. These are typically larger than disk chunks, and can be on the order of tens to hundreds of megabytes, depending on the use case and available memory.",[41,130,131,134],{},[48,132,133],{},"Rechunking",": The process of changing the chunking scheme of a dataset after it has been created. This can be done to optimize for different access patterns or analysis needs, but can be computationally expensive and may require additional storage space during the rechunking process.",[41,136,137,140],{},[48,138,139],{},"Serialisation",": The process of converting a dataset into a format that can be stored on disk or transmitted over a network. This typically involves writing the dataset to a file format such as netCDF, HDF, or Zarr, which may have specific requirements for chunking and metadata.",[41,142,143,146],{},[48,144,145],{},"Virtualisation",": A family of technologies that index the byte ranges within a file in order to directly access those bytes ranges.",[148,149,151],"h2",{"id":150},"chunking-the-30000-foot-view","Chunking - the 30,000 foot view",[13,153,154,155,157,158,160],{},"When opening a chunked dataset with ",[62,156,71],{}," and ",[62,159,64],{},", the opening of a dataset typically looks something like the following:",[162,163,164,169,180,189,192],"ol",{},[41,165,166,168],{},[62,167,71],{}," reads the header data from the file(s) to understand the structure of the dataset, including the variables, dimensions, and attributes. This is typically a very fast operation, as the header data is small and can be read quickly.",[41,170,171,173,174,176,177,179],{},[62,172,71],{}," creates a ",[62,175,64],{}," array for each variable in the dataset, using the chunking scheme defined in the file(s). This involves creating a ",[62,178,64],{}," array that is composed of many smaller chunks, each of which corresponds to an integer number of chunks on disk along each dimension.",[41,181,182,183,185,186,188],{},"If multiple files are opened, xarray concatenates the ",[62,184,64],{}," arrays from each file along the appropriate dimension(s) to create a single ",[62,187,64],{}," array for each variable that represents the entire dataset.",[41,190,191],{},"The user specifies an analysis to be performed on the dataset. In the background, xarray and dask build a 'task graph', a graph which describes the series of computations that must be performed in order to compute the analysis the user has described.",[41,193,194,195,198,199,201],{},"When the user calls ",[62,196,197],{},".compute()"," on the result of their analysis, ",[62,200,64],{}," executes the task graph, which involves reading the necessary chunks from disk, performing the necessary computations in memory, and creates the final result.",[203,204,206],"h3",{"id":205},"executing-the-task-graph","Executing the task graph",[13,208,209,210,214],{},"Let's first consider computation on a single dask chunk. Fundamentally, each dask chunk corresponds to a numpy array. The whole dask chunk is realised into memory as a single array, and so the entire numpy array corresponding to that chunk must, at the very least, fit into the total available system memory. This provides ",[211,212,213],"em",{},"the most conservative upper bound"," on the size of a dask chunk. Put simply, a laptop with 8GB of RAM cannot process a dask chunk that is larger than 8GB. In practice, the maximum size of a dask chunk is likely to be much smaller than the total available memory, as the system needs to allocate memory for other processes, and the analysis being performed may require multiple dask chunks to be loaded into memory at the same time.",[10,216,217],{},[13,218,219,223],{},[220,221,222],"span",{},"!TIP","\nThe example choice of a laptop with 8GB of RAM here is not purely coincidental. The optimal dask chunk choice may be much larger for eg. a megamem ARE session with 192GB of RAM, than on a personal laptop. As dask chunks should be an integer multiple of disk chunks, it is important that disk chunks are small enough that they can comfortablty fit into memory on all reasonably anticipated hardware.",[203,225,227],{"id":226},"combining-chunks","Combining Chunks",[13,229,230,231,234],{},"In order to perform an analysis, dask must build a ",[48,232,233],{},"task graph",", which it then uses to combine the results of each dask chunk computation into the final result. Each node in the task graph represents an operation on a single dask chunk. Each edge in the graph represents a dependency between operations on two dask chunks. The smaller the dask chunks, the more nodes and edges there will be in the task graph, and the more overhead there will be in executing the task graph.",[203,236,238],{"id":237},"principle-larger-chunks-result-in-a-smaller-task-graph-and-therefore-less-dask-overhead-in-executing-the-graph-this-comes-at-the-expense-of-increased-memory-pressure","Principle: Larger chunks result in a smaller task graph, and therefore less dask overhead in executing the graph. This comes at the expense of increased memory pressure.",[24,240],{},[24,242],{},[148,244,246],{"id":245},"a-disk-chunk-is-the-quanta-of-storage-on-disk-and-a-dask-chunk-is-the-quanta-of-computation-in-memory","A disk chunk is the quanta of storage on disk and a dask chunk is the quanta of computation in memory.",[13,248,249],{},"Let's assume we have a chunked dataset, with a couple of simplifying assumptions:",[38,251,252,255],{},[41,253,254],{},"Comprising a single variable",[41,256,257],{},"We intend to use the same dask and disk chunks.",[13,259,260,261,263],{},"When ",[62,262,71],{}," opens this file, it creates a dask array for the variable, with one dask chunk for each disk chunk. This means that for each disk chunk, there will be a corresponding in-memory numpy array.",[13,265,266],{},"Imagine now that we want to open the 'first' chunk of the dataset, and select a subset of it. This whole operation occurs in three parts:",[162,268,269,272,275],{},[41,270,271],{},"Open the chunked dataset, and read the metadata in order to determine the chunking scheme.",[41,273,274],{},"Read the entirety of the first chunk from disk into memory.",[41,276,277],{},"Discard the parts of the chunk which are irrelevant to our selection.",[13,279,280],{},"From this, we can infer a couple of important principles:",[162,282,283,286,289],{},[41,284,285],{},"Ideally, we do not want our disk chunks to be much larger than the typical size of a selection that a user might make. If they are, then users will be forced to read large chunks of data into memory, only to discard most of it.",[41,287,288],{},"If we want to support efficient selection of small subsets of the data, we need to ensure that our disk chunks are small enough to allow for this.",[41,290,291,292,295],{},"If our dask chunks are not well aligned with our disk chunks - for example, a dask chunk spans only half a disk chunk - then we will be forced to read the decode the disk chunk twice in order to write it into two separate dask chunks. IO is typically ",[211,293,294],{},"the largest bottleneck"," in analysis workflows, and so this is a situation we want to avoid.",[203,297,299],{"id":298},"principle-disk-chunks-should-be-small-enough-to-allow-for-efficient-selection-of-small-subsets-of-the-data-and-dask-chunks-should-be-an-integer-multiple-of-disk-chunks-in-order-to-avoid-unnecessary-io-overhead","Principle: Disk chunks should be small enough to allow for efficient selection of small subsets of the data, and dask chunks should be an integer multiple of disk chunks in order to avoid unnecessary IO overhead.",[24,301],{},[24,303],{},[148,305,307],{"id":306},"rechunking-it-is-cheaper-to-combine-than-split-chunks","Rechunking: It is cheaper to combine than split chunks",[13,309,310],{},"Consider two 'orthogonal' analysis workflows: one which generates a map, and one which generates a time series.",[38,312,313,316],{},[41,314,315],{},"The map workflow is optimised by chunking the data along the time dimension, so that each chunk contains the full spatial domain of the data, but only a single slice of time.",[41,317,318],{},"The time series workflow is optimised by chunking the data along the spatial dimensions, so that each chunk contains a single point in space, but the full time domain of the data.",[13,320,321],{},"This can be visualised as a 'pancakes to churros' or 'burgers to hotdogs' scenario:",[323,324,329],"pre",{"className":325,"code":327,"language":328},[326],"language-text"," +-------------------+      +-------------------+\n |                   |      |    |    |    |    |\n +-------------------+      +    |    |    |    +\n |                   |      |    |    |    |    |\n +-------------------+      +    |    |    |    +\n |                   |  ->  |    |    |    |    |\n +-------------------+      +    |    |    |    +\n |                   |      |    |    |    |    |\n +-------------------+      +    |    |    |    +\n |                   |      |    |    |    |    |\n +-------------------+      +-------------------+\n","text",[62,330,327],{"__ignoreMap":331},"",[13,333,334],{},"As a concrete example of this, consider a workflow where we wish to produce timeseries optimised chunks from map optimised chunks - as illustrated above.",[13,336,337],{},"In such a scenario, we would need to either:\na. Read the entire dataset into a single dask chunk (numpy array) in memory, and then split it into smaller chunks.\nb. Read each chunk repeatedly from disk, and write it into the appropriate dask chunks on disk. For example, to prodfuce the first churro, we might need to read the first pancake, write the first 10% of it into the first churro, then read the second pancake, write the first 10% of it into the first churro, and so on until we have read all pancakes and written the first churro. We would then repeat this process for each subsequent churro.",[13,339,340],{},"Now consider an isotropically chunked dataset. We can produce either pancakes, or churros, purely by combining chunks: it it not necessary to split any chunks, nor 'overread' any chunks. This is illustrated in the following diagrams:",[323,342,345],{"className":343,"code":344,"language":328},[326],"+---+---+---+---+---+        +-------------------+\n|   |   |   |   |   |        |    |    |    |    |\n+---+---+---+---+---+        +    |    |    |    +\n|   |   |   |   |   |        |    |    |    |    |\n+---+---+---+---+---+        +    |    |    |    +\n|   |   |   |   |   |    ->  |    |    |    |    |\n+---+---+---+---+---+        +    |    |    |    +\n|   |   |   |   |   |        |    |    |    |    |\n+---+---+---+---+---+        +    |    |    |    +\n|   |   |   |   |   |        |    |    |    |    |\n+---+---+---+---+---+        +-------------------+\n",[62,346,344],{"__ignoreMap":331},[323,348,351],{"className":349,"code":350,"language":328},[326],"+---+---+---+---+---+         +-------------------+\n|   |   |   |   |   |         |                   |\n+---+---+---+---+---+         +-------------------+\n|   |   |   |   |   |         |                   |\n+---+---+---+---+---+         +-------------------+\n|   |   |   |   |   |    ->   |                   |\n+---+---+---+---+---+         +-------------------+\n|   |   |   |   |   |         |                   |\n+---+---+---+---+---+         +-------------------+\n|   |   |   |   |   |         |                   |\n+---+---+---+---+---+         +-------------------+\n",[62,352,350],{"__ignoreMap":331},[13,354,355],{},"In this sense, although the isotropically chunked dataset is suboptimal for both workflows, it is the least suboptimal for both workflows, and therefore represents the optimal chunking scheme for an unknown use case.",[13,357,358],{},[211,359,360,361,364],{},"Question: Typically, it is common to produce either maps ",[48,362,363],{},"or"," timeseries. Combination plots such as Hoevmuller plots are less common. Therefore, it stands to reason that our notion of isotropic chunks ought to consider latitude and longitude to be somewhat entangled, and weighted together somehow. A mathematically pure notion of this currently escapes me, but I think it should involve square roots somehow.",[24,366],{},[24,368],{},[148,370,372],{"id":371},"why-not-tiny-disk-chunks","Why not tiny disk chunks?",[13,374,375],{},"As with dask chunks, smaller disk chunks require additional coordination overhead. This also happens unavoidably at an IO\u002Ffilesystem level. Each chunk is compressed individually, and so one decompression operation is required per chunk. Smaller chunks therefore require, amongst other things, more decompression operations, increasing overhead.",[203,377,379],{"id":378},"principle-the-best-chunking-for-a-given-operation-are-the-largest-chunks-that-do-not-cause-us-to-over-read-from-disk-or-exceed-our-memory-constraints-to-accomodate-for-multiple-possible-use-cases-we-should-aim-to-produce-the-largest-chunks-that-are-still-small-enough-to-allow-for-efficient-selection-of-small-subsets-of-the-data-and-that-can-comfortably-fit-into-memory-on-reasonably-anticipated-hardware","Principle: The best chunking for a given operation are the largest chunks that do not cause us to over-read from disk, or exceed our memory constraints. To accomodate for multiple possible use cases, we should aim to produce the largest chunks that are still small enough to allow for efficient selection of small subsets of the data, and that can comfortably fit into memory on reasonably anticipated hardware.",[24,381],{},[24,383],{},[148,385,387],{"id":386},"considering-the-whole-dataset-chunks-and-files","Considering the whole dataset: Chunks and Files",[13,389,390],{},"So far, we have only discussed chunking at a within-file level. However, as we have already noted, climate datasets are typically not contained within a single file, but are instead written out as a series of files. Each file typically contains the full spatial domain of the data, but only a subset of the time domain.",[13,392,393],{},"Crucially, we can think of files themselves as a chunk.",[13,395,396],{},"When choosing isotropic chunking schemes, this becomes important. Climate models are numerical integrations, and so typically written out in terms of time slices - one month of data per file, for example.",[13,398,399],{},"If we then try to create a chunking scheme that is isotropic at a within-file level, we may end up with a chunking scheme that is highly anisotropic at the whole dataset level. For example, if we have a dataset that is 360x180x50 (lon x lat x time), and we write it out as 5 files of 360x180x10, then we might choose to chunk each file into chunks of size 36x18x1. This would be isotropic at the within-file level, with ten chunks per dimension, but at the whole dataset level, we would have ten chunks per spatial dimension, but 50 in time.",[13,401,402],{},"This seems highly anisotropic - we have many more chunks in time than in space. However, as we have 5 files, there is no way to have fewer than 5 chunks in time. Therefore, the 'best' we can do is to have 5 chunks in time, and 10 chunks in each spatial dimension. This is still reasonably isotropic, and is likely to be the optimal chunking scheme for an unknown use case.",[203,404,406,407,410],{"id":405},"principle-files-are-chunks-and-cannot-be-ignored-chunking-schemes-must-take-these-file-chunks-into-account-as-they-are-less-mutable-than-disk-chunk-and-so-are-a-stronger-constraint-on-the-optimal-chunking-scheme","Principle: Files ",[211,408,409],{},"are"," chunks, and cannot be ignored. Chunking schemes must take these 'file chunks' into account, as they are less mutable than disk chunk, and so are a stronger constraint on the optimal chunking scheme.",[24,412],{},[24,414],{},[148,416,418],{"id":417},"future-proofing-avoiding-zarr-incompatible-chunking-schemes-for-virtualisation","Future Proofing: Avoiding Zarr-incompatible chunking schemes for virtualisation",[13,420,421],{},"Historically, climate model output has typically been written as netCDF. However, netCDF fares extremely poorly on cloud storage, due to assumptions which only hold on local filesystem storage.",[13,423,424],{},"Zarr is a modern, cloud optimised data format, which takes the notion of a dataset being comprised of multiple files, and extends that to the chunk level. A zarr store is a hierarchial directory tree, with separated metadata and a file for each chunk (or a group of chunks, known as sharding). However, this file format has historically fared poorly on HPC systems, as it creates large numbers of inodes unless sharding (unavailable prior to zarr v3) in used.",[13,426,427],{},"In the zarr data model, a large, multi-file netCDF dataset is represented by a single zarr store. The developers of zarr, noting that copying archival, multi PB datasets to zarr is prohibitively expensive, devloped a set of technologies known as virtualisation. Virtualisation takes a group of netCDF files, and creates a zarr store which indexes byte ranges within those files in order to directly access individual chunks.",[13,429,430,431,434],{},"This has a wide range of performance benefits and enables the used of Zarr's cloud optimised features and burgeoning ecosystem with netCDF datasets. However, it requires that the dataset being virtualised respects the zarr data model. In particular, for a multi file dataset, it requires that the chunking scheme for the combined dataset can be represented as a ",[211,432,433],{},"rectilinear chunk grid",". This can require particular care when choosing chunking schemes for multi-file datasets, as it is easy to end up with a chunking scheme that is incompatible with virtualisation, and therefore cannot be easily converted to virtual zarr in the future.",[13,436,437,438,441],{},"As a simple example, consider the following: daily data, written at monthly frequency. This ",[211,439,440],{},"cannot"," be virtualised, as the chunking scheme in time will be (31, 28, 31, 30 ...) for the different files, and so cannot be represented as a rectilinear chunk grid.",[13,443,444],{},"For a calendar without leap years, this can easily be solved by writing out data at either a daily, 73 day, or yearly frequency: these all produce rectilinear chunk grids. For a calendar with leap years, the implementation is more complex, but the principles are the same.",[203,446,448],{"id":447},"principle-combining-file-chunks-in-a-way-which-produces-variable-chunk-lengths-prohibits-future-virtualisation-avoid-wherever-possible","Principle: Combining file chunks in a way which produces variable chunk lengths prohibits future virtualisation. Avoid wherever possible.",{"title":331,"searchDepth":450,"depth":450,"links":451},2,[452,458,461,462,465,469],{"id":150,"depth":450,"text":151,"children":453},[454,456,457],{"id":205,"depth":455,"text":206},3,{"id":226,"depth":455,"text":227},{"id":237,"depth":455,"text":238},{"id":245,"depth":450,"text":246,"children":459},[460],{"id":298,"depth":455,"text":299},{"id":306,"depth":450,"text":307},{"id":371,"depth":450,"text":372,"children":463},[464],{"id":378,"depth":455,"text":379},{"id":386,"depth":450,"text":387,"children":466},[467],{"id":405,"depth":455,"text":468},"Principle: Files are chunks, and cannot be ignored. Chunking schemes must take these 'file chunks' into account, as they are less mutable than disk chunk, and so are a stronger constraint on the optimal chunking scheme.",{"id":417,"depth":450,"text":418,"children":470},[471],{"id":447,"depth":455,"text":448},"md",{},true,"\u002Fblog\u002Fchunking-notes",{"description":331},"blog\u002Fchunking-notes","kazJVOJjbf0A9bX1JTl1KtOLhslDwuhwgbcxlN2SRnM",1780312513230]