libStatGen Software  1
MemoryMap.h
1 /*
2  * Copyright (C) 2010 Regents of the University of Michigan
3  *
4  * This program is free software: you can redistribute it and/or modify
5  * it under the terms of the GNU General Public License as published by
6  * the Free Software Foundation, either version 3 of the License, or
7  * (at your option) any later version.
8  *
9  * This program is distributed in the hope that it will be useful,
10  * but WITHOUT ANY WARRANTY; without even the implied warranty of
11  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12  * GNU General Public License for more details.
13  *
14  * You should have received a copy of the GNU General Public License
15  * along with this program. If not, see <http://www.gnu.org/licenses/>.
16  */
17 
18 #ifndef __MEMORYMAP_H
19 #define __MEMORYMAP_H
20 #include <sys/types.h>
21 #include <fcntl.h>
22 
23 #if defined(_WIN32)
24 #include <windows.h>
25 #endif
26 
27 ///
28 /// There are a pair of related data structures in the operating system,
29 /// and also a few simple algorithms that explain why your processes are
30 /// waiting forever.
31 ///
32 /// The symptom you have is that they are getting little or no CPU time,
33 /// as shown in the command 'top'. The machine will appear to have
34 /// available CPU time (look at the Cpu(s): parameter - if less than 100%,
35 /// you have available CPU). The real key, however, is to look at the
36 /// 'top' column with the label 'S' - that is the status of the process,
37 /// and crucial to understanding what is going on.
38 ///
39 /// In your instance, the 'S' column for your karma jobs is 'D', which
40 /// means it is waiting for data. This is because the process is doing
41 /// something that is waiting for the filesystem to return data to it.
42 /// Usually, this is because of a C call like read() or write(), but it
43 /// also happens in large processes where memory was copied to disk and
44 /// re-used for other purposes (this is called paging).
45 ///
46 /// So, a bit of background on the operating system... there is a CPU
47 /// secheduler that takes a list of waiting processes, and picks one to
48 /// run - if the job is waiting for the disk, there is no point in picking
49 /// it to run, since it is blocked, waiting for the disk to return data.
50 /// The scheduler marks the process with 'D' and moves on to the next
51 /// process to schedule.
52 ///
53 /// In terms of data structures that we care about for this example, there
54 /// are two that we care about. First is a linear list of disk buffers
55 /// that are stored in RAM and controlled by the operating system. This
56 /// is usually called the disk buffer pool. Usually, when a program asks
57 /// for data from the disk, this list can be scanned quickly to see if the
58 /// data is already in RAM - if so, no disk operation needs to take place.
59 ///
60 /// Now in the case of the normal Unix read() and write() calls, when the
61 /// operating system is done finding the page, it copies the data into a
62 /// buffer to be used by the process that requested it (in the case of a
63 /// read() - a write() is the opposite). This copy operation is slow and
64 /// inefficient, but gets the job done.
65 ///
66 /// So overall, you gain some efficiency in a large memory system by
67 /// having this disk buffer pool data structure, since you aren't
68 /// re-reading the disk over and over to get the same data that you
69 /// already have in RAM. However, it is less efficient than it might be
70 /// because of the extra buffer copying.
71 ///
72 /// Now we come to memory mapped files, and karma. The underlying system
73 /// call of interest to us is mmap(), and is in MemoryMap.cpp. What it
74 /// does and how it works are important to understanding the benefits of
75 /// it, and frankly, most people don't care about it because it is
76 /// seemingly complex.
77 ///
78 /// Two things are important to know: firstly, there is a data structure
79 /// in the CPU called the page table, which is mostly contained in the CPU
80 /// hardware itself. All memory accesses for normal user processes like
81 /// karma go through this hardware page table. Secondly, it is very fast
82 /// for the operating system to put together a page table that 'connects'
83 /// a bunch of memory locations in your user programs address space to the
84 /// disk buffer pool pages.
85 ///
86 /// The combination of those two facts mean that you can implement a 'zero
87 /// copy' approach to reading data, which means that the data that is in
88 /// the disk buffer pool is directly readable by the program without the
89 /// operating system ever having to actually copy the data, like it does
90 /// for read() or write().
91 ///
92 /// So the benefit of mmap() is that when the underlying disk pages are
93 /// already in the disk buffer pool, a hardware data structure gets built,
94 /// then the program returns, and the data is available at full processor
95 /// speed with no intervening copy of the data, or waiting for disk or
96 /// anything else. It is as near to instantaneous as you can possibly
97 /// get. This works whether it is 100 bytes or 100 gigabytes.
98 ///
99 /// So, the last part of the puzzle is why your program winds up in 'D'
100 /// (data wait), and what to do about it.
101 ///
102 /// The disk buffer pool is a linear list of blocks ordered by the time
103 /// and date of access. A process runs every once in awhile to take the
104 /// oldest of those pages, and free them, during which it also has to
105 /// update the hardware page tables of any processes referencing them.
106 ///
107 /// So on wonderland, most file access (wget, copy, md5sum, anything else)
108 /// is constantly putting new fresh pages at the front of the list, and
109 /// karma index files, having been opened awhile ago, are prime candidates
110 /// for being paged out. The reason they get paged out as far as I know
111 /// is that in any given second of execution, nowhere near the entire
112 /// index is getting accessed... so at some point, at least one page gets
113 /// sent back to disk (well, flushed from RAM). Once that happens, a
114 /// cascading effect happens, where the longer it waits, the older the
115 /// other pages get, then the more that get reclaimed, and the slower it
116 /// gets, until karma is at a standstill, waiting for pages to be brought
117 /// back into RAM.
118 ///
119 /// Now in an ideal world, karma would rapidly recover, and it can...
120 /// sometimes. The problem is that your karma job is accessing data all
121 /// over that index, and it is essentially looking like a pure random I/O
122 /// to the underlying filesystem. There is about a 10 to 1 performance
123 /// difference between accessing the disk sequentially as compared to
124 /// randomly.
125 ///
126 /// So to make karma work better, the first thing I do when starting karma
127 /// is force it to read all of the disk pages in order. This causes the
128 /// entire index to be forced into memory in order, so it is forcing
129 /// sequential reads, which is the best case possible. There are
130 /// problems, for example if three karma jobs start at once, the disk I/O
131 /// is no longer as purely sequential as we would like. Also, if the
132 /// filesystem is busy taking care of other programs, even if karma thinks
133 /// it is forcing sequential I/O, the net result looks more random. This
134 /// happens when the system is starting to break down (thrashing) and it
135 /// will certainly stall, or look very very slow, or crash.
136 ///
137 /// The upshot of all of this is that when a single reference is shared,
138 /// it is more likely that all the pages will be in the disk buffer pool
139 /// to begin with, and thereby reduce startup time to nearly zero. It is
140 /// also the ideal situation in terms of sharing the same reference among
141 /// say 24 copies of karma on wonderland - the only cost is the hardware
142 /// page table that gets set up to point to all of the disk buffers.
143 ///
144 /// As I mentioned a paragraph back, the pages can still get swapped out,
145 /// even with dozens of karma jobs running. A workaround I created is a
146 /// program in utilities called mapfile - it simply repeatedly accesses
147 /// the data in sequential order to help ensure that all of the pages are
148 /// at the head of the disk buffer pool, and therefore less likely to get
149 /// swapped out.
150 ///
151 /// The benefit of such a program (mapfile) is greater on wonderland,
152 /// where a lot of processes are competing for memory and disk buffers.
153 ///
154 ///
156 {
157 #if defined(_WIN32)
158  HANDLE file_handle;
159  HANDLE map_handle;
160  DWORD page_size;
161 #else
162  int fd;
163  size_t page_size;
164 #endif
165  off_t offset;
166  size_t mapped_length;
167  size_t total_length;
168  bool useMemoryMapFlag;
169 public:
170 
171  void *data;
172 
173  MemoryMap();
174 
175  virtual ~MemoryMap();
176 
177  void debug_print();
178 
179  void constructor_clear();
180 
181  void destructor_clear();
182 
183  virtual bool allocate();
184 
185  /// open a previously created mapped vector
186  ///
187  /// useMemoryMapFlag will determine whether it
188  /// uses mmap() or malloc()/read() to populate
189  /// the memory
190  virtual bool open(const char * file, int flags = O_RDONLY);
191 
192  /// create the memory mapped file on disk
193  ///
194  /// a file will be created on disk with the header
195  /// filled in. The caller must now populate elements
196  /// using (*this).set(index, value).
197  //
198  virtual bool create(const char * file, size_t size);
199 
200  /// store in allocated memory (malloc), not mmap:
201  ///
202  /// This is for code that needs to more flexibly
203  /// the case when an mmap() file _might_ be available,
204  /// but if it is not, we want to load it as a convenience
205  /// to the user. GenomeSequence::populateDBSNP does exactly this.
206  //
207  virtual bool create(size_t size);
208 
209  bool close();
210  void test();
211  size_t length()
212  {
213  return mapped_length;
214  }
215 
216  char operator[](unsigned int index)
217  {
218  return ((char *)data)[index];
219  };
220  int prefetch(); // force pages into RAM
221 
222  //
223  // set or unset use of mmap() call in ::open().
224  // This flag must be set before ::open() is called,
225  // if it is called afterwards, it has no effect.
226  //
227  void useMemoryMap(bool flag=true)
228  {
229  useMemoryMapFlag = flag;
230  }
231 };
232 
233 #endif
MemoryMap::create
virtual bool create(const char *file, size_t size)
create the memory mapped file on disk
Definition: MemoryMap.cpp:243
MemoryMap::open
virtual bool open(const char *file, int flags=O_RDONLY)
open a previously created mapped vector
Definition: MemoryMap.cpp:156
MemoryMap
There are a pair of related data structures in the operating system, and also a few simple algorithms...
Definition: MemoryMap.h:156