Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1""" 

2Binary serialization 

3 

4NPY format 

5========== 

6 

7A simple format for saving numpy arrays to disk with the full 

8information about them. 

9 

10The ``.npy`` format is the standard binary file format in NumPy for 

11persisting a *single* arbitrary NumPy array on disk. The format stores all 

12of the shape and dtype information necessary to reconstruct the array 

13correctly even on another machine with a different architecture. 

14The format is designed to be as simple as possible while achieving 

15its limited goals. 

16 

17The ``.npz`` format is the standard format for persisting *multiple* NumPy 

18arrays on disk. A ``.npz`` file is a zip file containing multiple ``.npy`` 

19files, one for each array. 

20 

21Capabilities 

22------------ 

23 

24- Can represent all NumPy arrays including nested record arrays and 

25 object arrays. 

26 

27- Represents the data in its native binary form. 

28 

29- Supports Fortran-contiguous arrays directly. 

30 

31- Stores all of the necessary information to reconstruct the array 

32 including shape and dtype on a machine of a different 

33 architecture. Both little-endian and big-endian arrays are 

34 supported, and a file with little-endian numbers will yield 

35 a little-endian array on any machine reading the file. The 

36 types are described in terms of their actual sizes. For example, 

37 if a machine with a 64-bit C "long int" writes out an array with 

38 "long ints", a reading machine with 32-bit C "long ints" will yield 

39 an array with 64-bit integers. 

40 

41- Is straightforward to reverse engineer. Datasets often live longer than 

42 the programs that created them. A competent developer should be 

43 able to create a solution in their preferred programming language to 

44 read most ``.npy`` files that he has been given without much 

45 documentation. 

46 

47- Allows memory-mapping of the data. See `open_memmep`. 

48 

49- Can be read from a filelike stream object instead of an actual file. 

50 

51- Stores object arrays, i.e. arrays containing elements that are arbitrary 

52 Python objects. Files with object arrays are not to be mmapable, but 

53 can be read and written to disk. 

54 

55Limitations 

56----------- 

57 

58- Arbitrary subclasses of numpy.ndarray are not completely preserved. 

59 Subclasses will be accepted for writing, but only the array data will 

60 be written out. A regular numpy.ndarray object will be created 

61 upon reading the file. 

62 

63.. warning:: 

64 

65 Due to limitations in the interpretation of structured dtypes, dtypes 

66 with fields with empty names will have the names replaced by 'f0', 'f1', 

67 etc. Such arrays will not round-trip through the format entirely 

68 accurately. The data is intact; only the field names will differ. We are 

69 working on a fix for this. This fix will not require a change in the 

70 file format. The arrays with such structures can still be saved and 

71 restored, and the correct dtype may be restored by using the 

72 ``loadedarray.view(correct_dtype)`` method. 

73 

74File extensions 

75--------------- 

76 

77We recommend using the ``.npy`` and ``.npz`` extensions for files saved 

78in this format. This is by no means a requirement; applications may wish 

79to use these file formats but use an extension specific to the 

80application. In the absence of an obvious alternative, however, 

81we suggest using ``.npy`` and ``.npz``. 

82 

83Version numbering 

84----------------- 

85 

86The version numbering of these formats is independent of NumPy version 

87numbering. If the format is upgraded, the code in `numpy.io` will still 

88be able to read and write Version 1.0 files. 

89 

90Format Version 1.0 

91------------------ 

92 

93The first 6 bytes are a magic string: exactly ``\\x93NUMPY``. 

94 

95The next 1 byte is an unsigned byte: the major version number of the file 

96format, e.g. ``\\x01``. 

97 

98The next 1 byte is an unsigned byte: the minor version number of the file 

99format, e.g. ``\\x00``. Note: the version of the file format is not tied 

100to the version of the numpy package. 

101 

102The next 2 bytes form a little-endian unsigned short int: the length of 

103the header data HEADER_LEN. 

104 

105The next HEADER_LEN bytes form the header data describing the array's 

106format. It is an ASCII string which contains a Python literal expression 

107of a dictionary. It is terminated by a newline (``\\n``) and padded with 

108spaces (``\\x20``) to make the total of 

109``len(magic string) + 2 + len(length) + HEADER_LEN`` be evenly divisible 

110by 64 for alignment purposes. 

111 

112The dictionary contains three keys: 

113 

114 "descr" : dtype.descr 

115 An object that can be passed as an argument to the `numpy.dtype` 

116 constructor to create the array's dtype. 

117 "fortran_order" : bool 

118 Whether the array data is Fortran-contiguous or not. Since 

119 Fortran-contiguous arrays are a common form of non-C-contiguity, 

120 we allow them to be written directly to disk for efficiency. 

121 "shape" : tuple of int 

122 The shape of the array. 

123 

124For repeatability and readability, the dictionary keys are sorted in 

125alphabetic order. This is for convenience only. A writer SHOULD implement 

126this if possible. A reader MUST NOT depend on this. 

127 

128Following the header comes the array data. If the dtype contains Python 

129objects (i.e. ``dtype.hasobject is True``), then the data is a Python 

130pickle of the array. Otherwise the data is the contiguous (either C- 

131or Fortran-, depending on ``fortran_order``) bytes of the array. 

132Consumers can figure out the number of bytes by multiplying the number 

133of elements given by the shape (noting that ``shape=()`` means there is 

1341 element) by ``dtype.itemsize``. 

135 

136Format Version 2.0 

137------------------ 

138 

139The version 1.0 format only allowed the array header to have a total size of 

14065535 bytes. This can be exceeded by structured arrays with a large number of 

141columns. The version 2.0 format extends the header size to 4 GiB. 

142`numpy.save` will automatically save in 2.0 format if the data requires it, 

143else it will always use the more compatible 1.0 format. 

144 

145The description of the fourth element of the header therefore has become: 

146"The next 4 bytes form a little-endian unsigned int: the length of the header 

147data HEADER_LEN." 

148 

149Format Version 3.0 

150------------------ 

151 

152This version replaces the ASCII string (which in practice was latin1) with 

153a utf8-encoded string, so supports structured types with any unicode field 

154names. 

155 

156Notes 

157----- 

158The ``.npy`` format, including motivation for creating it and a comparison of 

159alternatives, is described in the `"npy-format" NEP 

160<https://www.numpy.org/neps/nep-0001-npy-format.html>`_, however details have 

161evolved with time and this document is more current. 

162 

163""" 

164import numpy 

165import io 

166import warnings 

167from numpy.lib.utils import safe_eval 

168from numpy.compat import ( 

169 isfileobj, os_fspath, pickle 

170 ) 

171 

172 

173__all__ = [] 

174 

175 

176MAGIC_PREFIX = b'\x93NUMPY' 

177MAGIC_LEN = len(MAGIC_PREFIX) + 2 

178ARRAY_ALIGN = 64 # plausible values are powers of 2 between 16 and 4096 

179BUFFER_SIZE = 2**18 # size of buffer for reading npz files in bytes 

180 

181# difference between version 1.0 and 2.0 is a 4 byte (I) header length 

182# instead of 2 bytes (H) allowing storage of large structured arrays 

183_header_size_info = { 

184 (1, 0): ('<H', 'latin1'), 

185 (2, 0): ('<I', 'latin1'), 

186 (3, 0): ('<I', 'utf8'), 

187} 

188 

189 

190def _check_version(version): 

191 if version not in [(1, 0), (2, 0), (3, 0), None]: 

192 msg = "we only support format version (1,0), (2,0), and (3,0), not %s" 

193 raise ValueError(msg % (version,)) 

194 

195def magic(major, minor): 

196 """ Return the magic string for the given file format version. 

197 

198 Parameters 

199 ---------- 

200 major : int in [0, 255] 

201 minor : int in [0, 255] 

202 

203 Returns 

204 ------- 

205 magic : str 

206 

207 Raises 

208 ------ 

209 ValueError if the version cannot be formatted. 

210 """ 

211 if major < 0 or major > 255: 

212 raise ValueError("major version must be 0 <= major < 256") 

213 if minor < 0 or minor > 255: 

214 raise ValueError("minor version must be 0 <= minor < 256") 

215 return MAGIC_PREFIX + bytes([major, minor]) 

216 

217def read_magic(fp): 

218 """ Read the magic string to get the version of the file format. 

219 

220 Parameters 

221 ---------- 

222 fp : filelike object 

223 

224 Returns 

225 ------- 

226 major : int 

227 minor : int 

228 """ 

229 magic_str = _read_bytes(fp, MAGIC_LEN, "magic string") 

230 if magic_str[:-2] != MAGIC_PREFIX: 

231 msg = "the magic string is not correct; expected %r, got %r" 

232 raise ValueError(msg % (MAGIC_PREFIX, magic_str[:-2])) 

233 major, minor = magic_str[-2:] 

234 return major, minor 

235 

236def _has_metadata(dt): 

237 if dt.metadata is not None: 

238 return True 

239 elif dt.names is not None: 

240 return any(_has_metadata(dt[k]) for k in dt.names) 

241 elif dt.subdtype is not None: 

242 return _has_metadata(dt.base) 

243 else: 

244 return False 

245 

246def dtype_to_descr(dtype): 

247 """ 

248 Get a serializable descriptor from the dtype. 

249 

250 The .descr attribute of a dtype object cannot be round-tripped through 

251 the dtype() constructor. Simple types, like dtype('float32'), have 

252 a descr which looks like a record array with one field with '' as 

253 a name. The dtype() constructor interprets this as a request to give 

254 a default name. Instead, we construct descriptor that can be passed to 

255 dtype(). 

256 

257 Parameters 

258 ---------- 

259 dtype : dtype 

260 The dtype of the array that will be written to disk. 

261 

262 Returns 

263 ------- 

264 descr : object 

265 An object that can be passed to `numpy.dtype()` in order to 

266 replicate the input dtype. 

267 

268 """ 

269 if _has_metadata(dtype): 

270 warnings.warn("metadata on a dtype may be saved or ignored, but will " 

271 "raise if saved when read. Use another form of storage.", 

272 UserWarning, stacklevel=2) 

273 if dtype.names is not None: 

274 # This is a record array. The .descr is fine. XXX: parts of the 

275 # record array with an empty name, like padding bytes, still get 

276 # fiddled with. This needs to be fixed in the C implementation of 

277 # dtype(). 

278 return dtype.descr 

279 else: 

280 return dtype.str 

281 

282def descr_to_dtype(descr): 

283 ''' 

284 descr may be stored as dtype.descr, which is a list of 

285 (name, format, [shape]) tuples where format may be a str or a tuple. 

286 Offsets are not explicitly saved, rather empty fields with 

287 name, format == '', '|Vn' are added as padding. 

288 

289 This function reverses the process, eliminating the empty padding fields. 

290 ''' 

291 if isinstance(descr, str): 

292 # No padding removal needed 

293 return numpy.dtype(descr) 

294 elif isinstance(descr, tuple): 

295 # subtype, will always have a shape descr[1] 

296 dt = descr_to_dtype(descr[0]) 

297 return numpy.dtype((dt, descr[1])) 

298 

299 titles = [] 

300 names = [] 

301 formats = [] 

302 offsets = [] 

303 offset = 0 

304 for field in descr: 

305 if len(field) == 2: 

306 name, descr_str = field 

307 dt = descr_to_dtype(descr_str) 

308 else: 

309 name, descr_str, shape = field 

310 dt = numpy.dtype((descr_to_dtype(descr_str), shape)) 

311 

312 # Ignore padding bytes, which will be void bytes with '' as name 

313 # Once support for blank names is removed, only "if name == ''" needed) 

314 is_pad = (name == '' and dt.type is numpy.void and dt.names is None) 

315 if not is_pad: 

316 title, name = name if isinstance(name, tuple) else (None, name) 

317 titles.append(title) 

318 names.append(name) 

319 formats.append(dt) 

320 offsets.append(offset) 

321 offset += dt.itemsize 

322 

323 return numpy.dtype({'names': names, 'formats': formats, 'titles': titles, 

324 'offsets': offsets, 'itemsize': offset}) 

325 

326def header_data_from_array_1_0(array): 

327 """ Get the dictionary of header metadata from a numpy.ndarray. 

328 

329 Parameters 

330 ---------- 

331 array : numpy.ndarray 

332 

333 Returns 

334 ------- 

335 d : dict 

336 This has the appropriate entries for writing its string representation 

337 to the header of the file. 

338 """ 

339 d = {'shape': array.shape} 

340 if array.flags.c_contiguous: 

341 d['fortran_order'] = False 

342 elif array.flags.f_contiguous: 

343 d['fortran_order'] = True 

344 else: 

345 # Totally non-contiguous data. We will have to make it C-contiguous 

346 # before writing. Note that we need to test for C_CONTIGUOUS first 

347 # because a 1-D array is both C_CONTIGUOUS and F_CONTIGUOUS. 

348 d['fortran_order'] = False 

349 

350 d['descr'] = dtype_to_descr(array.dtype) 

351 return d 

352 

353 

354def _wrap_header(header, version): 

355 """ 

356 Takes a stringified header, and attaches the prefix and padding to it 

357 """ 

358 import struct 

359 assert version is not None 

360 fmt, encoding = _header_size_info[version] 

361 if not isinstance(header, bytes): # always true on python 3 

362 header = header.encode(encoding) 

363 hlen = len(header) + 1 

364 padlen = ARRAY_ALIGN - ((MAGIC_LEN + struct.calcsize(fmt) + hlen) % ARRAY_ALIGN) 

365 try: 

366 header_prefix = magic(*version) + struct.pack(fmt, hlen + padlen) 

367 except struct.error: 

368 msg = "Header length {} too big for version={}".format(hlen, version) 

369 raise ValueError(msg) 

370 

371 # Pad the header with spaces and a final newline such that the magic 

372 # string, the header-length short and the header are aligned on a 

373 # ARRAY_ALIGN byte boundary. This supports memory mapping of dtypes 

374 # aligned up to ARRAY_ALIGN on systems like Linux where mmap() 

375 # offset must be page-aligned (i.e. the beginning of the file). 

376 return header_prefix + header + b' '*padlen + b'\n' 

377 

378 

379def _wrap_header_guess_version(header): 

380 """ 

381 Like `_wrap_header`, but chooses an appropriate version given the contents 

382 """ 

383 try: 

384 return _wrap_header(header, (1, 0)) 

385 except ValueError: 

386 pass 

387 

388 try: 

389 ret = _wrap_header(header, (2, 0)) 

390 except UnicodeEncodeError: 

391 pass 

392 else: 

393 warnings.warn("Stored array in format 2.0. It can only be" 

394 "read by NumPy >= 1.9", UserWarning, stacklevel=2) 

395 return ret 

396 

397 header = _wrap_header(header, (3, 0)) 

398 warnings.warn("Stored array in format 3.0. It can only be " 

399 "read by NumPy >= 1.17", UserWarning, stacklevel=2) 

400 return header 

401 

402 

403def _write_array_header(fp, d, version=None): 

404 """ Write the header for an array and returns the version used 

405 

406 Parameters 

407 ---------- 

408 fp : filelike object 

409 d : dict 

410 This has the appropriate entries for writing its string representation 

411 to the header of the file. 

412 version: tuple or None 

413 None means use oldest that works 

414 explicit version will raise a ValueError if the format does not 

415 allow saving this data. Default: None 

416 """ 

417 header = ["{"] 

418 for key, value in sorted(d.items()): 

419 # Need to use repr here, since we eval these when reading 

420 header.append("'%s': %s, " % (key, repr(value))) 

421 header.append("}") 

422 header = "".join(header) 

423 header = _filter_header(header) 

424 if version is None: 

425 header = _wrap_header_guess_version(header) 

426 else: 

427 header = _wrap_header(header, version) 

428 fp.write(header) 

429 

430def write_array_header_1_0(fp, d): 

431 """ Write the header for an array using the 1.0 format. 

432 

433 Parameters 

434 ---------- 

435 fp : filelike object 

436 d : dict 

437 This has the appropriate entries for writing its string 

438 representation to the header of the file. 

439 """ 

440 _write_array_header(fp, d, (1, 0)) 

441 

442 

443def write_array_header_2_0(fp, d): 

444 """ Write the header for an array using the 2.0 format. 

445 The 2.0 format allows storing very large structured arrays. 

446 

447 .. versionadded:: 1.9.0 

448 

449 Parameters 

450 ---------- 

451 fp : filelike object 

452 d : dict 

453 This has the appropriate entries for writing its string 

454 representation to the header of the file. 

455 """ 

456 _write_array_header(fp, d, (2, 0)) 

457 

458def read_array_header_1_0(fp): 

459 """ 

460 Read an array header from a filelike object using the 1.0 file format 

461 version. 

462 

463 This will leave the file object located just after the header. 

464 

465 Parameters 

466 ---------- 

467 fp : filelike object 

468 A file object or something with a `.read()` method like a file. 

469 

470 Returns 

471 ------- 

472 shape : tuple of int 

473 The shape of the array. 

474 fortran_order : bool 

475 The array data will be written out directly if it is either 

476 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

477 contiguous before writing it out. 

478 dtype : dtype 

479 The dtype of the file's data. 

480 

481 Raises 

482 ------ 

483 ValueError 

484 If the data is invalid. 

485 

486 """ 

487 return _read_array_header(fp, version=(1, 0)) 

488 

489def read_array_header_2_0(fp): 

490 """ 

491 Read an array header from a filelike object using the 2.0 file format 

492 version. 

493 

494 This will leave the file object located just after the header. 

495 

496 .. versionadded:: 1.9.0 

497 

498 Parameters 

499 ---------- 

500 fp : filelike object 

501 A file object or something with a `.read()` method like a file. 

502 

503 Returns 

504 ------- 

505 shape : tuple of int 

506 The shape of the array. 

507 fortran_order : bool 

508 The array data will be written out directly if it is either 

509 C-contiguous or Fortran-contiguous. Otherwise, it will be made 

510 contiguous before writing it out. 

511 dtype : dtype 

512 The dtype of the file's data. 

513 

514 Raises 

515 ------ 

516 ValueError 

517 If the data is invalid. 

518 

519 """ 

520 return _read_array_header(fp, version=(2, 0)) 

521 

522 

523def _filter_header(s): 

524 """Clean up 'L' in npz header ints. 

525 

526 Cleans up the 'L' in strings representing integers. Needed to allow npz 

527 headers produced in Python2 to be read in Python3. 

528 

529 Parameters 

530 ---------- 

531 s : string 

532 Npy file header. 

533 

534 Returns 

535 ------- 

536 header : str 

537 Cleaned up header. 

538 

539 """ 

540 import tokenize 

541 from io import StringIO 

542 

543 tokens = [] 

544 last_token_was_number = False 

545 for token in tokenize.generate_tokens(StringIO(s).readline): 

546 token_type = token[0] 

547 token_string = token[1] 

548 if (last_token_was_number and 

549 token_type == tokenize.NAME and 

550 token_string == "L"): 

551 continue 

552 else: 

553 tokens.append(token) 

554 last_token_was_number = (token_type == tokenize.NUMBER) 

555 return tokenize.untokenize(tokens) 

556 

557 

558def _read_array_header(fp, version): 

559 """ 

560 see read_array_header_1_0 

561 """ 

562 # Read an unsigned, little-endian short int which has the length of the 

563 # header. 

564 import struct 

565 hinfo = _header_size_info.get(version) 

566 if hinfo is None: 

567 raise ValueError("Invalid version {!r}".format(version)) 

568 hlength_type, encoding = hinfo 

569 

570 hlength_str = _read_bytes(fp, struct.calcsize(hlength_type), "array header length") 

571 header_length = struct.unpack(hlength_type, hlength_str)[0] 

572 header = _read_bytes(fp, header_length, "array header") 

573 header = header.decode(encoding) 

574 

575 # The header is a pretty-printed string representation of a literal 

576 # Python dictionary with trailing newlines padded to a ARRAY_ALIGN byte 

577 # boundary. The keys are strings. 

578 # "shape" : tuple of int 

579 # "fortran_order" : bool 

580 # "descr" : dtype.descr 

581 header = _filter_header(header) 

582 try: 

583 d = safe_eval(header) 

584 except SyntaxError as e: 

585 msg = "Cannot parse header: {!r}\nException: {!r}" 

586 raise ValueError(msg.format(header, e)) 

587 if not isinstance(d, dict): 

588 msg = "Header is not a dictionary: {!r}" 

589 raise ValueError(msg.format(d)) 

590 keys = sorted(d.keys()) 

591 if keys != ['descr', 'fortran_order', 'shape']: 

592 msg = "Header does not contain the correct keys: {!r}" 

593 raise ValueError(msg.format(keys)) 

594 

595 # Sanity-check the values. 

596 if (not isinstance(d['shape'], tuple) or 

597 not numpy.all([isinstance(x, int) for x in d['shape']])): 

598 msg = "shape is not valid: {!r}" 

599 raise ValueError(msg.format(d['shape'])) 

600 if not isinstance(d['fortran_order'], bool): 

601 msg = "fortran_order is not a valid bool: {!r}" 

602 raise ValueError(msg.format(d['fortran_order'])) 

603 try: 

604 dtype = descr_to_dtype(d['descr']) 

605 except TypeError: 

606 msg = "descr is not a valid dtype descriptor: {!r}" 

607 raise ValueError(msg.format(d['descr'])) 

608 

609 return d['shape'], d['fortran_order'], dtype 

610 

611def write_array(fp, array, version=None, allow_pickle=True, pickle_kwargs=None): 

612 """ 

613 Write an array to an NPY file, including a header. 

614 

615 If the array is neither C-contiguous nor Fortran-contiguous AND the 

616 file_like object is not a real file object, this function will have to 

617 copy data in memory. 

618 

619 Parameters 

620 ---------- 

621 fp : file_like object 

622 An open, writable file object, or similar object with a 

623 ``.write()`` method. 

624 array : ndarray 

625 The array to write to disk. 

626 version : (int, int) or None, optional 

627 The version number of the format. None means use the oldest 

628 supported version that is able to store the data. Default: None 

629 allow_pickle : bool, optional 

630 Whether to allow writing pickled data. Default: True 

631 pickle_kwargs : dict, optional 

632 Additional keyword arguments to pass to pickle.dump, excluding 

633 'protocol'. These are only useful when pickling objects in object 

634 arrays on Python 3 to Python 2 compatible format. 

635 

636 Raises 

637 ------ 

638 ValueError 

639 If the array cannot be persisted. This includes the case of 

640 allow_pickle=False and array being an object array. 

641 Various other errors 

642 If the array contains Python objects as part of its dtype, the 

643 process of pickling them may raise various errors if the objects 

644 are not picklable. 

645 

646 """ 

647 _check_version(version) 

648 _write_array_header(fp, header_data_from_array_1_0(array), version) 

649 

650 if array.itemsize == 0: 

651 buffersize = 0 

652 else: 

653 # Set buffer size to 16 MiB to hide the Python loop overhead. 

654 buffersize = max(16 * 1024 ** 2 // array.itemsize, 1) 

655 

656 if array.dtype.hasobject: 

657 # We contain Python objects so we cannot write out the data 

658 # directly. Instead, we will pickle it out 

659 if not allow_pickle: 

660 raise ValueError("Object arrays cannot be saved when " 

661 "allow_pickle=False") 

662 if pickle_kwargs is None: 

663 pickle_kwargs = {} 

664 pickle.dump(array, fp, protocol=3, **pickle_kwargs) 

665 elif array.flags.f_contiguous and not array.flags.c_contiguous: 

666 if isfileobj(fp): 

667 array.T.tofile(fp) 

668 else: 

669 for chunk in numpy.nditer( 

670 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

671 buffersize=buffersize, order='F'): 

672 fp.write(chunk.tobytes('C')) 

673 else: 

674 if isfileobj(fp): 

675 array.tofile(fp) 

676 else: 

677 for chunk in numpy.nditer( 

678 array, flags=['external_loop', 'buffered', 'zerosize_ok'], 

679 buffersize=buffersize, order='C'): 

680 fp.write(chunk.tobytes('C')) 

681 

682 

683def read_array(fp, allow_pickle=False, pickle_kwargs=None): 

684 """ 

685 Read an array from an NPY file. 

686 

687 Parameters 

688 ---------- 

689 fp : file_like object 

690 If this is not a real file object, then this may take extra memory 

691 and time. 

692 allow_pickle : bool, optional 

693 Whether to allow writing pickled data. Default: False 

694 

695 .. versionchanged:: 1.16.3 

696 Made default False in response to CVE-2019-6446. 

697 

698 pickle_kwargs : dict 

699 Additional keyword arguments to pass to pickle.load. These are only 

700 useful when loading object arrays saved on Python 2 when using 

701 Python 3. 

702 

703 Returns 

704 ------- 

705 array : ndarray 

706 The array from the data on disk. 

707 

708 Raises 

709 ------ 

710 ValueError 

711 If the data is invalid, or allow_pickle=False and the file contains 

712 an object array. 

713 

714 """ 

715 version = read_magic(fp) 

716 _check_version(version) 

717 shape, fortran_order, dtype = _read_array_header(fp, version) 

718 if len(shape) == 0: 

719 count = 1 

720 else: 

721 count = numpy.multiply.reduce(shape, dtype=numpy.int64) 

722 

723 # Now read the actual data. 

724 if dtype.hasobject: 

725 # The array contained Python objects. We need to unpickle the data. 

726 if not allow_pickle: 

727 raise ValueError("Object arrays cannot be loaded when " 

728 "allow_pickle=False") 

729 if pickle_kwargs is None: 

730 pickle_kwargs = {} 

731 try: 

732 array = pickle.load(fp, **pickle_kwargs) 

733 except UnicodeError as err: 

734 # Friendlier error message 

735 raise UnicodeError("Unpickling a python object failed: %r\n" 

736 "You may need to pass the encoding= option " 

737 "to numpy.load" % (err,)) 

738 else: 

739 if isfileobj(fp): 

740 # We can use the fast fromfile() function. 

741 array = numpy.fromfile(fp, dtype=dtype, count=count) 

742 else: 

743 # This is not a real file. We have to read it the 

744 # memory-intensive way. 

745 # crc32 module fails on reads greater than 2 ** 32 bytes, 

746 # breaking large reads from gzip streams. Chunk reads to 

747 # BUFFER_SIZE bytes to avoid issue and reduce memory overhead 

748 # of the read. In non-chunked case count < max_read_count, so 

749 # only one read is performed. 

750 

751 # Use np.ndarray instead of np.empty since the latter does 

752 # not correctly instantiate zero-width string dtypes; see 

753 # https://github.com/numpy/numpy/pull/6430 

754 array = numpy.ndarray(count, dtype=dtype) 

755 

756 if dtype.itemsize > 0: 

757 # If dtype.itemsize == 0 then there's nothing more to read 

758 max_read_count = BUFFER_SIZE // min(BUFFER_SIZE, dtype.itemsize) 

759 

760 for i in range(0, count, max_read_count): 

761 read_count = min(max_read_count, count - i) 

762 read_size = int(read_count * dtype.itemsize) 

763 data = _read_bytes(fp, read_size, "array data") 

764 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype, 

765 count=read_count) 

766 

767 if fortran_order: 

768 array.shape = shape[::-1] 

769 array = array.transpose() 

770 else: 

771 array.shape = shape 

772 

773 return array 

774 

775 

776def open_memmap(filename, mode='r+', dtype=None, shape=None, 

777 fortran_order=False, version=None): 

778 """ 

779 Open a .npy file as a memory-mapped array. 

780 

781 This may be used to read an existing file or create a new one. 

782 

783 Parameters 

784 ---------- 

785 filename : str or path-like 

786 The name of the file on disk. This may *not* be a file-like 

787 object. 

788 mode : str, optional 

789 The mode in which to open the file; the default is 'r+'. In 

790 addition to the standard file modes, 'c' is also accepted to mean 

791 "copy on write." See `memmap` for the available mode strings. 

792 dtype : data-type, optional 

793 The data type of the array if we are creating a new file in "write" 

794 mode, if not, `dtype` is ignored. The default value is None, which 

795 results in a data-type of `float64`. 

796 shape : tuple of int 

797 The shape of the array if we are creating a new file in "write" 

798 mode, in which case this parameter is required. Otherwise, this 

799 parameter is ignored and is thus optional. 

800 fortran_order : bool, optional 

801 Whether the array should be Fortran-contiguous (True) or 

802 C-contiguous (False, the default) if we are creating a new file in 

803 "write" mode. 

804 version : tuple of int (major, minor) or None 

805 If the mode is a "write" mode, then this is the version of the file 

806 format used to create the file. None means use the oldest 

807 supported version that is able to store the data. Default: None 

808 

809 Returns 

810 ------- 

811 marray : memmap 

812 The memory-mapped array. 

813 

814 Raises 

815 ------ 

816 ValueError 

817 If the data or the mode is invalid. 

818 IOError 

819 If the file is not found or cannot be opened correctly. 

820 

821 See Also 

822 -------- 

823 memmap 

824 

825 """ 

826 if isfileobj(filename): 

827 raise ValueError("Filename must be a string or a path-like object." 

828 " Memmap cannot use existing file handles.") 

829 

830 if 'w' in mode: 

831 # We are creating the file, not reading it. 

832 # Check if we ought to create the file. 

833 _check_version(version) 

834 # Ensure that the given dtype is an authentic dtype object rather 

835 # than just something that can be interpreted as a dtype object. 

836 dtype = numpy.dtype(dtype) 

837 if dtype.hasobject: 

838 msg = "Array can't be memory-mapped: Python objects in dtype." 

839 raise ValueError(msg) 

840 d = dict( 

841 descr=dtype_to_descr(dtype), 

842 fortran_order=fortran_order, 

843 shape=shape, 

844 ) 

845 # If we got here, then it should be safe to create the file. 

846 with open(os_fspath(filename), mode+'b') as fp: 

847 _write_array_header(fp, d, version) 

848 offset = fp.tell() 

849 else: 

850 # Read the header of the file first. 

851 with open(os_fspath(filename), 'rb') as fp: 

852 version = read_magic(fp) 

853 _check_version(version) 

854 

855 shape, fortran_order, dtype = _read_array_header(fp, version) 

856 if dtype.hasobject: 

857 msg = "Array can't be memory-mapped: Python objects in dtype." 

858 raise ValueError(msg) 

859 offset = fp.tell() 

860 

861 if fortran_order: 

862 order = 'F' 

863 else: 

864 order = 'C' 

865 

866 # We need to change a write-only mode to a read-write mode since we've 

867 # already written data to the file. 

868 if mode == 'w+': 

869 mode = 'r+' 

870 

871 marray = numpy.memmap(filename, dtype=dtype, shape=shape, order=order, 

872 mode=mode, offset=offset) 

873 

874 return marray 

875 

876 

877def _read_bytes(fp, size, error_template="ran out of data"): 

878 """ 

879 Read from file-like object until size bytes are read. 

880 Raises ValueError if not EOF is encountered before size bytes are read. 

881 Non-blocking objects only supported if they derive from io objects. 

882 

883 Required as e.g. ZipExtFile in python 2.6 can return less data than 

884 requested. 

885 """ 

886 data = bytes() 

887 while True: 

888 # io files (default in python3) return None or raise on 

889 # would-block, python2 file will truncate, probably nothing can be 

890 # done about that. note that regular files can't be non-blocking 

891 try: 

892 r = fp.read(size - len(data)) 

893 data += r 

894 if len(r) == 0 or len(data) == size: 

895 break 

896 except io.BlockingIOError: 

897 pass 

898 if len(data) != size: 

899 msg = "EOF: reading %s, expected %d bytes got %d" 

900 raise ValueError(msg % (error_template, size, len(data))) 

901 else: 

902 return data