libStatGen Software
1
MemoryMap.h
1
/*
2
* Copyright (C) 2010 Regents of the University of Michigan
3
*
4
* This program is free software: you can redistribute it and/or modify
5
* it under the terms of the GNU General Public License as published by
6
* the Free Software Foundation, either version 3 of the License, or
7
* (at your option) any later version.
8
*
9
* This program is distributed in the hope that it will be useful,
10
* but WITHOUT ANY WARRANTY; without even the implied warranty of
11
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12
* GNU General Public License for more details.
13
*
14
* You should have received a copy of the GNU General Public License
15
* along with this program. If not, see <http://www.gnu.org/licenses/>.
16
*/
17
18
#ifndef __MEMORYMAP_H
19
#define __MEMORYMAP_H
20
#include <sys/types.h>
21
#include <fcntl.h>
22
23
#if defined(_WIN32)
24
#include <windows.h>
25
#endif
26
27
///
28
/// There are a pair of related data structures in the operating system,
29
/// and also a few simple algorithms that explain why your processes are
30
/// waiting forever.
31
///
32
/// The symptom you have is that they are getting little or no CPU time,
33
/// as shown in the command 'top'. The machine will appear to have
34
/// available CPU time (look at the Cpu(s): parameter - if less than 100%,
35
/// you have available CPU). The real key, however, is to look at the
36
/// 'top' column with the label 'S' - that is the status of the process,
37
/// and crucial to understanding what is going on.
38
///
39
/// In your instance, the 'S' column for your karma jobs is 'D', which
40
/// means it is waiting for data. This is because the process is doing
41
/// something that is waiting for the filesystem to return data to it.
42
/// Usually, this is because of a C call like read() or write(), but it
43
/// also happens in large processes where memory was copied to disk and
44
/// re-used for other purposes (this is called paging).
45
///
46
/// So, a bit of background on the operating system... there is a CPU
47
/// secheduler that takes a list of waiting processes, and picks one to
48
/// run - if the job is waiting for the disk, there is no point in picking
49
/// it to run, since it is blocked, waiting for the disk to return data.
50
/// The scheduler marks the process with 'D' and moves on to the next
51
/// process to schedule.
52
///
53
/// In terms of data structures that we care about for this example, there
54
/// are two that we care about. First is a linear list of disk buffers
55
/// that are stored in RAM and controlled by the operating system. This
56
/// is usually called the disk buffer pool. Usually, when a program asks
57
/// for data from the disk, this list can be scanned quickly to see if the
58
/// data is already in RAM - if so, no disk operation needs to take place.
59
///
60
/// Now in the case of the normal Unix read() and write() calls, when the
61
/// operating system is done finding the page, it copies the data into a
62
/// buffer to be used by the process that requested it (in the case of a
63
/// read() - a write() is the opposite). This copy operation is slow and
64
/// inefficient, but gets the job done.
65
///
66
/// So overall, you gain some efficiency in a large memory system by
67
/// having this disk buffer pool data structure, since you aren't
68
/// re-reading the disk over and over to get the same data that you
69
/// already have in RAM. However, it is less efficient than it might be
70
/// because of the extra buffer copying.
71
///
72
/// Now we come to memory mapped files, and karma. The underlying system
73
/// call of interest to us is mmap(), and is in MemoryMap.cpp. What it
74
/// does and how it works are important to understanding the benefits of
75
/// it, and frankly, most people don't care about it because it is
76
/// seemingly complex.
77
///
78
/// Two things are important to know: firstly, there is a data structure
79
/// in the CPU called the page table, which is mostly contained in the CPU
80
/// hardware itself. All memory accesses for normal user processes like
81
/// karma go through this hardware page table. Secondly, it is very fast
82
/// for the operating system to put together a page table that 'connects'
83
/// a bunch of memory locations in your user programs address space to the
84
/// disk buffer pool pages.
85
///
86
/// The combination of those two facts mean that you can implement a 'zero
87
/// copy' approach to reading data, which means that the data that is in
88
/// the disk buffer pool is directly readable by the program without the
89
/// operating system ever having to actually copy the data, like it does
90
/// for read() or write().
91
///
92
/// So the benefit of mmap() is that when the underlying disk pages are
93
/// already in the disk buffer pool, a hardware data structure gets built,
94
/// then the program returns, and the data is available at full processor
95
/// speed with no intervening copy of the data, or waiting for disk or
96
/// anything else. It is as near to instantaneous as you can possibly
97
/// get. This works whether it is 100 bytes or 100 gigabytes.
98
///
99
/// So, the last part of the puzzle is why your program winds up in 'D'
100
/// (data wait), and what to do about it.
101
///
102
/// The disk buffer pool is a linear list of blocks ordered by the time
103
/// and date of access. A process runs every once in awhile to take the
104
/// oldest of those pages, and free them, during which it also has to
105
/// update the hardware page tables of any processes referencing them.
106
///
107
/// So on wonderland, most file access (wget, copy, md5sum, anything else)
108
/// is constantly putting new fresh pages at the front of the list, and
109
/// karma index files, having been opened awhile ago, are prime candidates
110
/// for being paged out. The reason they get paged out as far as I know
111
/// is that in any given second of execution, nowhere near the entire
112
/// index is getting accessed... so at some point, at least one page gets
113
/// sent back to disk (well, flushed from RAM). Once that happens, a
114
/// cascading effect happens, where the longer it waits, the older the
115
/// other pages get, then the more that get reclaimed, and the slower it
116
/// gets, until karma is at a standstill, waiting for pages to be brought
117
/// back into RAM.
118
///
119
/// Now in an ideal world, karma would rapidly recover, and it can...
120
/// sometimes. The problem is that your karma job is accessing data all
121
/// over that index, and it is essentially looking like a pure random I/O
122
/// to the underlying filesystem. There is about a 10 to 1 performance
123
/// difference between accessing the disk sequentially as compared to
124
/// randomly.
125
///
126
/// So to make karma work better, the first thing I do when starting karma
127
/// is force it to read all of the disk pages in order. This causes the
128
/// entire index to be forced into memory in order, so it is forcing
129
/// sequential reads, which is the best case possible. There are
130
/// problems, for example if three karma jobs start at once, the disk I/O
131
/// is no longer as purely sequential as we would like. Also, if the
132
/// filesystem is busy taking care of other programs, even if karma thinks
133
/// it is forcing sequential I/O, the net result looks more random. This
134
/// happens when the system is starting to break down (thrashing) and it
135
/// will certainly stall, or look very very slow, or crash.
136
///
137
/// The upshot of all of this is that when a single reference is shared,
138
/// it is more likely that all the pages will be in the disk buffer pool
139
/// to begin with, and thereby reduce startup time to nearly zero. It is
140
/// also the ideal situation in terms of sharing the same reference among
141
/// say 24 copies of karma on wonderland - the only cost is the hardware
142
/// page table that gets set up to point to all of the disk buffers.
143
///
144
/// As I mentioned a paragraph back, the pages can still get swapped out,
145
/// even with dozens of karma jobs running. A workaround I created is a
146
/// program in utilities called mapfile - it simply repeatedly accesses
147
/// the data in sequential order to help ensure that all of the pages are
148
/// at the head of the disk buffer pool, and therefore less likely to get
149
/// swapped out.
150
///
151
/// The benefit of such a program (mapfile) is greater on wonderland,
152
/// where a lot of processes are competing for memory and disk buffers.
153
///
154
///
155
class
MemoryMap
156
{
157
#if defined(_WIN32)
158
HANDLE file_handle;
159
HANDLE map_handle;
160
DWORD page_size;
161
#else
162
int
fd;
163
size_t
page_size;
164
#endif
165
off_t offset;
166
size_t
mapped_length;
167
size_t
total_length;
168
bool
useMemoryMapFlag;
169
public
:
170
171
void
*data;
172
173
MemoryMap
();
174
175
virtual
~
MemoryMap
();
176
177
void
debug_print();
178
179
void
constructor_clear();
180
181
void
destructor_clear();
182
183
virtual
bool
allocate();
184
185
/// open a previously created mapped vector
186
///
187
/// useMemoryMapFlag will determine whether it
188
/// uses mmap() or malloc()/read() to populate
189
/// the memory
190
virtual
bool
open
(
const
char
* file,
int
flags = O_RDONLY);
191
192
/// create the memory mapped file on disk
193
///
194
/// a file will be created on disk with the header
195
/// filled in. The caller must now populate elements
196
/// using (*this).set(index, value).
197
//
198
virtual
bool
create
(
const
char
* file,
size_t
size);
199
200
/// store in allocated memory (malloc), not mmap:
201
///
202
/// This is for code that needs to more flexibly
203
/// the case when an mmap() file _might_ be available,
204
/// but if it is not, we want to load it as a convenience
205
/// to the user. GenomeSequence::populateDBSNP does exactly this.
206
//
207
virtual
bool
create
(
size_t
size);
208
209
bool
close();
210
void
test();
211
size_t
length()
212
{
213
return
mapped_length;
214
}
215
216
char
operator[](
unsigned
int
index)
217
{
218
return
((
char
*)data)[index];
219
};
220
int
prefetch();
// force pages into RAM
221
222
//
223
// set or unset use of mmap() call in ::open().
224
// This flag must be set before ::open() is called,
225
// if it is called afterwards, it has no effect.
226
//
227
void
useMemoryMap(
bool
flag=
true
)
228
{
229
useMemoryMapFlag = flag;
230
}
231
};
232
233
#endif
MemoryMap::create
virtual bool create(const char *file, size_t size)
create the memory mapped file on disk
Definition:
MemoryMap.cpp:243
MemoryMap::open
virtual bool open(const char *file, int flags=O_RDONLY)
open a previously created mapped vector
Definition:
MemoryMap.cpp:156
MemoryMap
There are a pair of related data structures in the operating system, and also a few simple algorithms...
Definition:
MemoryMap.h:156
general
MemoryMap.h
Generated by
1.8.18