docs/analyzer/DesignDiscussions/InitializerLists.rst

   1 This discussion took place in https://reviews.llvm.org/D35216
   2 "Escape symbols when creating std::initializer_list".
   3
   4 It touches problems of modelling C++ standard library constructs in general,
   5 including modelling implementation-defined fields within C++ standard library
   6 objects, in particular constructing objects into pointers held by such fields,
   7 and separation of responsibilities between analyzer's core and checkers.
   8
   9 **Artem:**
  10
  11 I've seen a few false positives that appear because we construct
  12 C++11 std::initializer_list objects with brace initializers, and such
  13 construction is not properly modeled. For instance, if a new object is
  14 constructed on the heap only to be put into a brace-initialized STL container,
  15 the object is reported to be leaked.
  16
  17 Approach (0): This can be trivially fixed by this patch, which causes pointers
  18 passed into initializer list expressions to immediately escape.
  19
  20 This fix is overly conservative though. So i did a bit of investigation as to
  21 how model std::initializer_list better.
  22
  23 According to the standard, std::initializer_list<T> is an object that has
  24 methods begin(), end(), and size(), where begin() returns a pointer to continuous
  25 array of size() objects of type T, and end() is equal to begin() plus size().
  26 The standard does hint that it should be possible to implement
  27 std::initializer_list<T> as a pair of pointers, or as a pointer and a size
  28 integer, however specific fields that the object would contain are an
  29 implementation detail.
  30
  31 Ideally, we should be able to model the initializer list's methods precisely.
  32 Or, at least, it should be possible to explain to the analyzer that the list
  33 somehow "takes hold" of the values put into it. Initializer lists can also be
  34 copied, which is a separate story that i'm not trying to address here.
  35
  36 The obvious approach to modeling std::initializer_list in a checker would be to
  37 construct a SymbolMetadata for the memory region of the initializer list object,
  38 which would be of type T* and represent begin(), so we'd trivially model begin()
  39 as a function that returns this symbol. The array pointed to by that symbol
  40 would be bindLoc()ed to contain the list's contents (probably as a CompoundVal
  41 to produce less bindings in the store). Extent of this array would represent
  42 size() and would be equal to the length of the list as written.
  43
  44 So this sounds good, however apparently it does nothing to address our false
  45 positives: when the list escapes, our RegionStoreManager is not magically
  46 guessing that the metadata symbol attached to it, together with its contents,
  47 should also escape. In fact, it's impossible to trigger a pointer escape from
  48 within the checker.
  49
  50 Approach (1): If only we enabled ProgramState::bindLoc(..., notifyChanges=true)
  51 to cause pointer escapes (not only region changes) (which sounds like the right
  52 thing to do anyway) such checker would be able to solve the false positives by
  53 triggering escapes when binding list elements to the list. However, it'd be as
  54 conservative as the current patch's solution. Ideally, we do not want escapes to
  55 happen so early. Instead, we'd prefer them to be delayed until the list itself
  56 escapes.
  57
  58 So i believe that escaping metadata symbols whenever their base regions escape
  59 would be the right thing to do. Currently we didn't think about that because we
  60 had neither pointer-type metadatas nor non-pointer escapes.
  61
  62 Approach (2): We could teach the Store to scan itself for bindings to
  63 metadata-symbolic-based regions during scanReachableSymbols() whenever a region
  64 turns out to be reachable. This requires no work on checker side, but it sounds
  65 performance-heavy.
  66
  67 Approach (3): We could let checkers maintain the set of active metadata symbols
  68 in the program state (ideally somewhere in the Store, which sounds weird but
  69 causes the smallest amount of layering violations), so that the core knew what
  70 to escape. This puts a stress on the checkers, but with a smart data map it
  71 wouldn't be a problem.
  72
  73 Approach (4): We could allow checkers to trigger pointer escapes in arbitrary
  74 moments. If we allow doing this within checkPointerEscape callback itself, we
  75 would be able to express facts like "when this region escapes, that metadata
  76 symbol attached to it should also escape". This sounds like an ultimate freedom,
  77 with maximum stress on the checkers - still not too much stress when we have
  78 smart data maps.
  79
  80 I'm personally liking the approach (2) - it should be possible to avoid
  81 performance overhead, and clarity seems nice.
  82
  83 **Gabor:**
  84
  85 At this point, I am a bit wondering about two questions.
  86
  87 - When should something belong to a checker and when should something belong
  88 to the engine? Sometimes we model library aspects in the engine and model
  89 language constructs in checkers.
  90 - What is the checker programming model that we are aiming for? Maximum
  91 freedom or more easy checker development?
  92
  93 I think if we aim for maximum freedom, we do not need to worry about the
  94 potential stress on checkers, and we can introduce abstractions to mitigate that
  95 later on.
  96 If we want to simplify the API, then maybe it makes more sense to move language
  97 construct modeling to the engine when the checker API is not sufficient instead
  98 of complicating the API.
  99
 100 Right now I have no preference or objections between the alternatives but there
 101 are some random thoughts:
 102
 103 - Maybe it would be great to have a guideline how to evolve the analyzer and
 104 follow it, so it can help us to decide in similar situations
 105 - I do care about performance in this case. The reason is that we have a
 106 limited performance budget. And I think we should not expect most of the checker
 107 writers to add modeling of language constructs. So, in my opinion, it is ok to
 108 have less nice/more verbose API for language modeling if we can have better
 109 performance this way, since it only needs to be done once, and is done by the
 110 framework developers.
 111
 112 **Artem:** These are some great questions, i guess it'd be better to discuss
 113 them more openly. As a quick dump of my current mood:
 114
 115 - To me it seems obvious that we need to aim for a checker API that is both
 116 simple and powerful. This can probably by keeping the API as powerful as
 117 necessary while providing a layer of simple ready-made solutions on top of it.
 118 Probably a few reusable components for assembling checkers. And this layer
 119 should ideally be pleasant enough to work with, so that people would prefer to
 120 extend it when something is lacking, instead of falling back to the complex
 121 omnipotent API. I'm thinking of AST matchers vs. AST visitors as a roughly
 122 similar situation: matchers are not omnipotent, but they're so nice.
 123
 124 - Separation between core and checkers is usually quite strange. Once we have
 125 shared state traits, i generally wouldn't mind having region store or range
 126 constraint manager as checkers (though it's probably not worth it to transform
 127 them - just a mood). The main thing to avoid here would be the situation when
 128 the checker overwrites stuff written by the core because it thinks it has a
 129 better idea what's going on, so the core should provide a good default behavior.
 130
 131 - Yeah, i totally care about performance as well, and if i try to implement
 132 approach, i'd make sure it's good.
 133
 134 **Artem:**
 135
 136 > Approach (2): We could teach the Store to scan itself for bindings to
 137 > metadata-symbolic-based regions during scanReachableSymbols() whenever
 138 > a region turns out to be reachable. This requires no work on checker side,
 139 > but it sounds performance-heavy.
 140
 141 Nope, this approach is wrong. Metadata symbols may become out-of-date: when the
 142 object changes, metadata symbols attached to it aren't changing (because symbols
 143 simply don't change). The same metadata may have different symbols to denote its
 144 value in different moments of time, but at most one of them represents the
 145 actual metadata value. So we'd be escaping more stuff than necessary.
 146
 147 If only we had "ghost fields"
 148 (http://lists.llvm.org/pipermail/cfe-dev/2016-May/049000.html), it would have
 149 been much easier, because the ghost field would only contain the actual
 150 metadata, and the Store would always know about it. This example adds to my
 151 belief that ghost fields are exactly what we need for most C++ checkers.
 152
 153 **Devin:**
 154
 155 In this case, I would be fine with some sort of
 156 AbstractStorageMemoryRegion that meant "here is a memory region and somewhere
 157 reachable from here exists another region of type T". Or even multiple regions
 158 with different identifiers. This wouldn't specify how the memory is reachable,
 159 but it would allow for transfer functions to get at those regions and it would
 160 allow for invalidation.
 161
 162 For std::initializer_list this reachable region would the region for the backing
 163 array and the transfer functions for begin() and end() yield the beginning and
 164 end element regions for it.
 165
 166 In my view this differs from ghost variables in that (1) this storage does
 167 actually exist (it is just a library implementation detail where that storage
 168 lives) and (2) it is perfectly valid for a pointer into that storage to be
 169 returned and for another part of the program to read or write from that storage.
 170 (Well, in this case just read since it is allowed to be read-only memory).
 171
 172 What I'm not OK with is modeling abstract analysis state (for example, the count
 173 of a NSMutableArray or the typestate of a file handle) as a value stored in some
 174 ginned up region in the store. This takes an easy problem that the analyzer does
 175 well at (modeling typestate) and turns it into a hard one that the analyzer is
 176 bad at (reasoning about the contents of the heap).
 177
 178 I think the key criterion here is: "is the region accessible from outside the
 179 library". That is, does the library expose the region as a pointer that can be
 180 read to or written from in the client program? If so, then it makes sense for
 181 this to be in the store: we are modeling reachable storage as storage. But if
 182 we're just modeling arbitrary analysis facts that need to be invalidated when a
 183 pointer escapes then we shouldn't try to gin up storage for them just to get
 184 invalidation for free.
 185
 186 **Artem:**
 187
 188 > In this case, I would be fine with some sort of AbstractStorageMemoryRegion
 189 > that meant "here is a memory region and somewhere reachable from here exists
 190 > another region of type T". Or even multiple regions with different
 191 > identifiers. This wouldn't specify how the memory is reachable, but it would
 192 > allow for transfer functions to get at those regions and it would allow for
 193 > invalidation.
 194
 195 Yeah, this is what we can easily implement now as a
 196 symbolic-region-based-on-a-metadata-symbol (though we can make a new region
 197 class for that if we eg. want it typed). The problem is that the relation
 198 between such storage region and its parent object region is essentially
 199 immaterial, similarly to the relation between SymbolRegionValue and its parent
 200 region. Region contents are mutable: today the abstract storage is reachable
 201 from its parent object, tomorrow it's not, and maybe something else becomes
 202 reachable, something that isn't even abstract. So the parent region for the
 203 abstract storage is most of the time at best a "nice to know" thing - we cannot
 204 rely on it to do any actual work. We'd anyway need to rely on the checker to do
 205 the job.
 206
 207 > For std::initializer_list this reachable region would the region for the
 208 > backing array and the transfer functions for begin() and end() yield the
 209 > beginning and end element regions for it.
 210
 211 So maybe in fact for std::initializer_list it may work fine because you cannot
 212 change the data after the object is constructed - so this region's contents are
 213 essentially immutable. For the future, i feel as if it is a dead end.
 214
 215 I'd like to consider another funny example. Suppose we're trying to model
 216 std::unique_ptr. Consider::
 217
 218   void bar(const std::unique_ptr<int> &x);
 219
 220   void foo(std::unique_ptr<int> &x) {
 221     int *a = x.get();   // (a, 0, direct): &AbstractStorageRegion
 222     *a = 1;             // (AbstractStorageRegion, 0, direct): 1 S32b
 223     int *b = new int;
 224     *b = 2;             // (SymRegion{conj_$0<int *>}, 0 ,direct): 2 S32b
 225     x.reset(b);         // Checker map: x -> SymRegion{conj_$0<int *>}
 226     bar(x);             // 'a' doesn't escape (the pointer was unique), 'b' does.
 227     clang_analyzer_eval(*a == 1); // Making this true is up to the checker.
 228     clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker.
 229   }
 230
 231 The checker doesn't totally need to ensure that *a == 1 passes - even though the
 232 pointer was unique, it could theoretically have .get()-ed above and the code
 233 could of course break the uniqueness invariant (though we'd probably want it).
 234 The checker can say that "even if *a did escape, it was not because it was
 235 stuffed directly into bar()".
 236
 237 The checker's direct responsibility, however, is to solve the *b == 2 thing
 238 (which is in fact the problem we're dealing with in this patch - escaping the
 239 storage region of the object).
 240
 241 So we're talking about one more operation over the program state (scanning
 242 reachable symbols and regions) that cannot work without checker support.
 243
 244 We can probably add a new callback "checkReachableSymbols" to solve this. This
 245 is in fact also related to the dead symbols problem (we're scanning for live
 246 symbols in the store and in the checkers separately, but we need to do so
 247 simultaneously with a single worklist). Hmm, in fact this sounds like a good
 248 idea; we can replace checkLiveSymbols with checkReachableSymbols.
 249
 250 Or we could just have ghost member variables, and no checker support required at
 251 all. For ghost member variables, the relation with their parent region (which
 252 would be their superregion) is actually useful, the mutability of their contents
 253 is expressed naturally, and the store automagically sees reachable symbols, live
 254 symbols, escapes, invalidations, whatever.
 255
 256 > In my view this differs from ghost variables in that (1) this storage does
 257 > actually exist (it is just a library implementation detail where that storage
 258 > lives) and (2) it is perfectly valid for a pointer into that storage to be
 259 > returned and for another part of the program to read or write from that
 260 > storage. (Well, in this case just read since it is allowed to be read-only
 261 > memory).
 262
 263 > What I'm not OK with is modeling abstract analysis state (for example, the
 264 > count of a NSMutableArray or the typestate of a file handle) as a value stored
 265 > in some ginned up region in the store.This takes an easy problem that the
 266 > analyzer does well at (modeling typestate) and turns it into a hard one that
 267 > the analyzer is bad at (reasoning about the contents of the heap).
 268
 269 Yeah, i tend to agree on that. For simple typestates, this is probably an
 270 overkill, so let's definitely put aside the idea of "ghost symbolic regions"
 271 that i had earlier.
 272
 273 But, to summarize a bit, in our current case, however, the typestate we're
 274 looking for is the contents of the heap. And when we try to model such
 275 typestates (complex in this specific manner, i.e. heap-like) in any checker, we
 276 have a choice between re-doing this modeling in every such checker (which is
 277 something analyzer is indeed good at, but at a price of making checkers heavy)
 278 or instead relying on the Store to do exactly what it's designed to do.
 279
 280 > I think the key criterion here is: "is the region accessible from outside
 281 > the library". That is, does the library expose the region as a pointer that
 282 > can be read to or written from in the client program? If so, then it makes
 283 > sense for this to be in the store: we are modeling reachable storage as
 284 > storage. But if we're just modeling arbitrary analysis facts that need to be
 285 > invalidated when a pointer escapes then we shouldn't try to gin up storage
 286 > for them just to get invalidation for free.
 287
 288 As a metaphor, i'd probably compare it to body farms - the difference between
 289 ghost member variables and metadata symbols seems to me like the difference
 290 between body farms and evalCall. Both are nice to have, and body farms are very
 291 pleasant to work with, even if not omnipotent. I think it's fine for a
 292 FunctionDecl's body in a body farm to have a local variable, even if such
 293 variable doesn't actually exist, even if it cannot be seen from outside the
 294 function call. I'm not seeing immediate practical difference between "it does
 295 actually exist" and "it doesn't actually exist, just a handy abstraction".
 296 Similarly, i think it's fine if we have a CXXRecordDecl with
 297 implementation-defined contents, and try to farm up a member variable as a handy
 298 abstraction (we don't even need to know its name or offset, only that it's there
 299 somewhere).
 300
 301 **Artem:**
 302
 303 We've discussed it in person with Devin, and he provided more points to think
 304 about:
 305
 306 - If the initializer list consists of non-POD data, constructors of list's
 307 objects need to take the sub-region of the list's region as this-region In the
 308 current (v2) version of this patch, these objects are constructed elsewhere and
 309 then trivial-copied into the list's metadata pointer region, which may be
 310 incorrect. This is our overall problem with C++ constructors, which manifests in
 311 this case as well. Additionally, objects would need to be constructed in the
 312 analyzer's core, which would not be able to predict that it needs to take a
 313 checker-specific region as this-region, which makes it harder, though it might
 314 be mitigated by sharing the checker state traits.
 315
 316 - Because "ghost variables" are not material to the user, we need to somehow
 317 make super sure that they don't make it into the diagnostic messages.
 318
 319 So, because this needs further digging into overall C++ support and rises too
 320 many questions, i'm delaying a better approach to this problem and will fall
 321 back to the original trivial patch.