diff options
| author | Daniel Vetter <daniel.vetter@ffwll.ch> | 2012-06-25 19:06:12 +0200 |
|---|---|---|
| committer | Daniel Vetter <daniel.vetter@ffwll.ch> | 2012-06-25 19:10:36 +0200 |
| commit | 7b0cfee1a24efdfe0235bac62e53f686fe8a8e24 (patch) | |
| tree | eeeb8cc3bf7be5ec0e54b7c4f3808ef88ecca012 /Documentation | |
| parent | 9756fe38d10b2bf90c81dc4d2f17d5632e135364 (diff) | |
| parent | 6b16351acbd415e66ba16bf7d473ece1574cf0bc (diff) | |
Merge tag 'v3.5-rc4' into drm-intel-next-queued
I want to merge the "no more fake agp on gen6+" patches into
drm-intel-next (well, the last pieces). But a patch in 3.5-rc4 also
adds a new use of dev->agp. Hence the backmarge to sort this out, for
otherwise drm-intel-next merged into Linus' tree would conflict in the
relevant code, things would compile but nicely OOPS at driver load :(
Conflicts in this merge are just simple cases of "both branches
changed/added lines at the same place". The only tricky part is to
keep the order correct wrt the unwind code in case of errors in
intel_ringbuffer.c (and the MI_DISPLAY_FLIP #defines in i915_reg.h
together, obviously).
Conflicts:
drivers/gpu/drm/i915/i915_reg.h
drivers/gpu/drm/i915/intel_ringbuffer.c
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Diffstat (limited to 'Documentation')
26 files changed, 870 insertions, 31 deletions
diff --git a/Documentation/ABI/testing/sysfs-block-rssd b/Documentation/ABI/testing/sysfs-block-rssd index d535757799fe..679ce3543122 100644 --- a/Documentation/ABI/testing/sysfs-block-rssd +++ b/Documentation/ABI/testing/sysfs-block-rssd @@ -6,13 +6,21 @@ Description: This is a read-only file. Dumps below driver information and hardware registers. - S ACTive - Command Issue - - Allocated - Completed - PORT IRQ STAT - HOST IRQ STAT + - Allocated + - Commands in Q What: /sys/block/rssd*/status Date: April 2012 KernelVersion: 3.4 Contact: Asai Thambi S P <asamymuthupa@micron.com> -Description: This is a read-only file. Indicates the status of the device. +Description: This is a read-only file. Indicates the status of the device. + +What: /sys/block/rssd*/flags +Date: May 2012 +KernelVersion: 3.5 +Contact: Asai Thambi S P <asamymuthupa@micron.com> +Description: This is a read-only file. Dumps the flags in port and driver + data structure diff --git a/Documentation/ABI/testing/sysfs-bus-fcoe b/Documentation/ABI/testing/sysfs-bus-fcoe new file mode 100644 index 000000000000..469d09c02f6b --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-fcoe @@ -0,0 +1,77 @@ +What: /sys/bus/fcoe/ctlr_X +Date: March 2012 +KernelVersion: TBD +Contact: Robert Love <robert.w.love@intel.com>, devel@open-fcoe.org +Description: 'FCoE Controller' instances on the fcoe bus +Attributes: + + fcf_dev_loss_tmo: Device loss timeout peroid (see below). Changing + this value will change the dev_loss_tmo for all + FCFs discovered by this controller. + + lesb_link_fail: Link Error Status Block (LESB) link failure count. + + lesb_vlink_fail: Link Error Status Block (LESB) virtual link + failure count. + + lesb_miss_fka: Link Error Status Block (LESB) missed FCoE + Initialization Protocol (FIP) Keep-Alives (FKA). + + lesb_symb_err: Link Error Status Block (LESB) symbolic error count. + + lesb_err_block: Link Error Status Block (LESB) block error count. + + lesb_fcs_error: Link Error Status Block (LESB) Fibre Channel + Serivces error count. + +Notes: ctlr_X (global increment starting at 0) + +What: /sys/bus/fcoe/fcf_X +Date: March 2012 +KernelVersion: TBD +Contact: Robert Love <robert.w.love@intel.com>, devel@open-fcoe.org +Description: 'FCoE FCF' instances on the fcoe bus. A FCF is a Fibre Channel + Forwarder, which is a FCoE switch that can accept FCoE + (Ethernet) packets, unpack them, and forward the embedded + Fibre Channel frames into a FC fabric. It can also take + outbound FC frames and pack them in Ethernet packets to + be sent to their destination on the Ethernet segment. +Attributes: + + fabric_name: Identifies the fabric that the FCF services. + + switch_name: Identifies the FCF. + + priority: The switch's priority amongst other FCFs on the same + fabric. + + selected: 1 indicates that the switch has been selected for use; + 0 indicates that the swich will not be used. + + fc_map: The Fibre Channel MAP + + vfid: The Virtual Fabric ID + + mac: The FCF's MAC address + + fka_peroid: The FIP Keep-Alive peroid + + fabric_state: The internal kernel state + "Unknown" - Initialization value + "Disconnected" - No link to the FCF/fabric + "Connected" - Host is connected to the FCF + "Deleted" - FCF is being removed from the system + + dev_loss_tmo: The device loss timeout peroid for this FCF. + +Notes: A device loss infrastructre similar to the FC Transport's + is present in fcoe_sysfs. It is nice to have so that a + link flapping adapter doesn't continually advance the count + used to identify the discovered FCF. FCFs will exist in a + "Disconnected" state until either the timer expires and the + FCF becomes "Deleted" or the FCF is rediscovered and becomes + "Connected." + + +Users: The first user of this interface will be the fcoeadm application, + which is commonly packaged in the fcoe-utils package. diff --git a/Documentation/ABI/testing/sysfs-bus-iio b/Documentation/ABI/testing/sysfs-bus-iio index 5bc8a476c15e..cfedf63cce15 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio +++ b/Documentation/ABI/testing/sysfs-bus-iio @@ -219,6 +219,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_voltageY_scale What: /sys/bus/iio/devices/iio:deviceX/in_voltageY_supply_scale What: /sys/bus/iio/devices/iio:deviceX/in_voltage_scale What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_scale +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_scale What: /sys/bus/iio/devices/iio:deviceX/in_accel_scale What: /sys/bus/iio/devices/iio:deviceX/in_accel_peak_scale What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_scale @@ -273,6 +274,7 @@ What: /sys/bus/iio/devices/iio:deviceX/in_accel_scale_available What: /sys/.../iio:deviceX/in_voltageX_scale_available What: /sys/.../iio:deviceX/in_voltage-voltage_scale_available What: /sys/.../iio:deviceX/out_voltageX_scale_available +What: /sys/.../iio:deviceX/out_altvoltageX_scale_available What: /sys/.../iio:deviceX/in_capacitance_scale_available KernelVersion: 2.635 Contact: linux-iio@vger.kernel.org @@ -298,14 +300,19 @@ Description: gives the 3dB frequency of the filter in Hz. What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_raw +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_raw KernelVersion: 2.6.37 Contact: linux-iio@vger.kernel.org Description: Raw (unscaled, no bias etc.) output voltage for channel Y. The number must always be specified and unique if the output corresponds to a single channel. + While DAC like devices typically use out_voltage, + a continuous frequency generating device, such as + a DDS or PLL should use out_altvoltage. What: /sys/bus/iio/devices/iio:deviceX/out_voltageY&Z_raw +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY&Z_raw KernelVersion: 2.6.37 Contact: linux-iio@vger.kernel.org Description: @@ -316,6 +323,8 @@ Description: What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_powerdown_mode What: /sys/bus/iio/devices/iio:deviceX/out_voltage_powerdown_mode +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_powerdown_mode +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltage_powerdown_mode KernelVersion: 2.6.38 Contact: linux-iio@vger.kernel.org Description: @@ -330,6 +339,8 @@ Description: What: /sys/.../iio:deviceX/out_votlageY_powerdown_mode_available What: /sys/.../iio:deviceX/out_voltage_powerdown_mode_available +What: /sys/.../iio:deviceX/out_altvotlageY_powerdown_mode_available +What: /sys/.../iio:deviceX/out_altvoltage_powerdown_mode_available KernelVersion: 2.6.38 Contact: linux-iio@vger.kernel.org Description: @@ -338,6 +349,8 @@ Description: What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_powerdown What: /sys/bus/iio/devices/iio:deviceX/out_voltage_powerdown +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_powerdown +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltage_powerdown KernelVersion: 2.6.38 Contact: linux-iio@vger.kernel.org Description: @@ -346,6 +359,24 @@ Description: normal operation. Y may be suppressed if all outputs are controlled together. +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency +KernelVersion: 3.4.0 +Contact: linux-iio@vger.kernel.org +Description: + Output frequency for channel Y in Hz. The number must always be + specified and unique if the output corresponds to a single + channel. + +What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_phase +KernelVersion: 3.4.0 +Contact: linux-iio@vger.kernel.org +Description: + Phase in radians of one frequency/clock output Y + (out_altvoltageY) relative to another frequency/clock output + (out_altvoltageZ) of the device X. The number must always be + specified and unique if the output corresponds to a single + channel. + What: /sys/bus/iio/devices/iio:deviceX/events KernelVersion: 2.6.35 Contact: linux-iio@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-class-mtd b/Documentation/ABI/testing/sysfs-class-mtd index 4d55a1888981..db1ad7e34fc3 100644 --- a/Documentation/ABI/testing/sysfs-class-mtd +++ b/Documentation/ABI/testing/sysfs-class-mtd @@ -123,3 +123,54 @@ Description: half page, or a quarter page). In the case of ECC NOR, it is the ECC block size. + +What: /sys/class/mtd/mtdX/ecc_strength +Date: April 2012 +KernelVersion: 3.4 +Contact: linux-mtd@lists.infradead.org +Description: + Maximum number of bit errors that the device is capable of + correcting within each region covering an ecc step. This will + always be a non-negative integer. Note that some devices will + have multiple ecc steps within each writesize region. + + In the case of devices lacking any ECC capability, it is 0. + +What: /sys/class/mtd/mtdX/bitflip_threshold +Date: April 2012 +KernelVersion: 3.4 +Contact: linux-mtd@lists.infradead.org +Description: + This allows the user to examine and adjust the criteria by which + mtd returns -EUCLEAN from mtd_read(). If the maximum number of + bit errors that were corrected on any single region comprising + an ecc step (as reported by the driver) equals or exceeds this + value, -EUCLEAN is returned. Otherwise, absent an error, 0 is + returned. Higher layers (e.g., UBI) use this return code as an + indication that an erase block may be degrading and should be + scrutinized as a candidate for being marked as bad. + + The initial value may be specified by the flash device driver. + If not, then the default value is ecc_strength. + + The introduction of this feature brings a subtle change to the + meaning of the -EUCLEAN return code. Previously, it was + interpreted to mean simply "one or more bit errors were + corrected". Its new interpretation can be phrased as "a + dangerously high number of bit errors were corrected on one or + more regions comprising an ecc step". The precise definition of + "dangerously high" can be adjusted by the user with + bitflip_threshold. Users are discouraged from doing this, + however, unless they know what they are doing and have intimate + knowledge of the properties of their device. Broadly speaking, + bitflip_threshold should be low enough to detect genuine erase + block degradation, but high enough to avoid the consequences of + a persistent return value of -EUCLEAN on devices where sticky + bitflips occur. Note that if bitflip_threshold exceeds + ecc_strength, -EUCLEAN is never returned by mtd_read(). + Conversely, if bitflip_threshold is zero, -EUCLEAN is always + returned, absent a hard error. + + This is generally applicable only to NAND flash devices with ECC + capability. It is ignored on devices lacking ECC capability; + i.e., devices for which ecc_strength is zero. diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index c58b236bbe04..cb9258b8fd35 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -671,8 +671,9 @@ ones already enabled by DEBUG. Chapter 14: Allocating memory The kernel provides the following general purpose memory allocators: -kmalloc(), kzalloc(), kcalloc(), vmalloc(), and vzalloc(). Please refer to -the API documentation for further information about them. +kmalloc(), kzalloc(), kmalloc_array(), kcalloc(), vmalloc(), and +vzalloc(). Please refer to the API documentation for further information +about them. The preferred form for passing a size of a struct is the following: @@ -686,6 +687,17 @@ Casting the return value which is a void pointer is redundant. The conversion from void pointer to any other pointer type is guaranteed by the C programming language. +The preferred form for allocating an array is the following: + + p = kmalloc_array(n, sizeof(...), ...); + +The preferred form for allocating a zeroed array is the following: + + p = kcalloc(n, sizeof(...), ...); + +Both forms check for overflow on the allocation size n * sizeof(...), +and return NULL if that occurred. + Chapter 15: The inline disease diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl index 0c674be0d3c6..e0aedb7a7827 100644 --- a/Documentation/DocBook/mtdnand.tmpl +++ b/Documentation/DocBook/mtdnand.tmpl @@ -1119,8 +1119,6 @@ in this page</entry> These constants are defined in nand.h. They are ored together to describe the chip functionality. <programlisting> -/* Chip can not auto increment pages */ -#define NAND_NO_AUTOINCR 0x00000001 /* Buswitdh is 16 bit */ #define NAND_BUSWIDTH_16 0x00000002 /* Device supports partial programming without padding */ diff --git a/Documentation/arm/OMAP/DSS b/Documentation/arm/OMAP/DSS index 888ae7b83ae4..a564ceea9e98 100644 --- a/Documentation/arm/OMAP/DSS +++ b/Documentation/arm/OMAP/DSS @@ -47,6 +47,51 @@ flexible way to enable non-common multi-display configuration. In addition to modelling the hardware overlays, omapdss supports virtual overlays and overlay managers. These can be used when updating a display with CPU or system DMA. +omapdss driver support for audio +-------------------------------- +There exist several display technologies and standards that support audio as +well. Hence, it is relevant to update the DSS device driver to provide an audio +interface that may be used by an audio driver or any other driver interested in +the functionality. + +The audio_enable function is intended to prepare the relevant +IP for playback (e.g., enabling an audio FIFO, taking in/out of reset +some IP, enabling companion chips, etc). It is intended to be called before +audio_start. The audio_disable function performs the reverse operation and is +intended to be called after audio_stop. + +While a given DSS device driver may support audio, it is possible that for +certain configurations audio is not supported (e.g., an HDMI display using a +VESA video timing). The audio_supported function is intended to query whether +the current configuration of the display supports audio. + +The audio_config function is intended to configure all the relevant audio +parameters of the display. In order to make the function independent of any +specific DSS device driver, a struct omap_dss_audio is defined. Its purpose +is to contain all the required parameters for audio configuration. At the +moment, such structure contains pointers to IEC-60958 channel status word +and CEA-861 audio infoframe structures. This should be enough to support +HDMI and DisplayPort, as both are based on CEA-861 and IEC-60958. + +The audio_enable/disable, audio_config and audio_supported functions could be +implemented as functions that may sleep. Hence, they should not be called +while holding a spinlock or a readlock. + +The audio_start/audio_stop function is intended to effectively start/stop audio +playback after the configuration has taken place. These functions are designed +to be used in an atomic context. Hence, audio_start should return quickly and be +called only after all the needed resources for audio playback (audio FIFOs, +DMA channels, companion chips, etc) have been enabled to begin data transfers. +audio_stop is designed to only stop the audio transfers. The resources used +for playback are released using audio_disable. + +The enum omap_dss_audio_state may be used to help the implementations of +the interface to keep track of the audio state. The initial state is _DISABLED; +then, the state transitions to _CONFIGURED, and then, when it is ready to +play audio, to _ENABLED. The state _PLAYING is used when the audio is being +rendered. + + Panel and controller drivers ---------------------------- @@ -156,6 +201,7 @@ timings Display timings (pixclock,xres/hfp/hbp/hsw,yres/vfp/vbp/vsw) "pal" and "ntsc" panel_name tear_elim Tearing elimination 0=off, 1=on +output_type Output type (video encoder only): "composite" or "svideo" There are also some debugfs files at <debugfs>/omapdss/ which show information about clocks and registers. diff --git a/Documentation/arm/SPEAr/overview.txt b/Documentation/arm/SPEAr/overview.txt index 57aae7765c74..65610bf52ebf 100644 --- a/Documentation/arm/SPEAr/overview.txt +++ b/Documentation/arm/SPEAr/overview.txt @@ -60,4 +60,4 @@ Introduction Document Author --------------- - Viresh Kumar <viresh.kumar@st.com>, (c) 2010-2012 ST Microelectronics + Viresh Kumar <viresh.linux@gmail.com>, (c) 2010-2012 ST Microelectronics diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt index 3370bc4d7b98..f5cfc62b7ad3 100644 --- a/Documentation/device-mapper/thin-provisioning.txt +++ b/Documentation/device-mapper/thin-provisioning.txt @@ -287,6 +287,17 @@ iii) Messages the current transaction id is when you change it with this compare-and-swap message. + reserve_metadata_snap + + Reserve a copy of the data mapping btree for use by userland. + This allows userland to inspect the mappings as they were when + this message was executed. Use the pool's status command to + get the root block associated with the metadata snapshot. + + release_metadata_snap + + Release a previously reserved copy of the data mapping btree. + 'thin' target ------------- diff --git a/Documentation/devicetree/bindings/i2c/i2c-mux-pinctrl.txt b/Documentation/devicetree/bindings/i2c/i2c-mux-pinctrl.txt new file mode 100644 index 000000000000..ae8af1694e95 --- /dev/null +++ b/Documentation/devicetree/bindings/i2c/i2c-mux-pinctrl.txt @@ -0,0 +1,93 @@ +Pinctrl-based I2C Bus Mux + +This binding describes an I2C bus multiplexer that uses pin multiplexing to +route the I2C signals, and represents the pin multiplexing configuration +using the pinctrl device tree bindings. + + +-----+ +-----+ + | dev | | dev | + +------------------------+ +-----+ +-----+ + | SoC | | | + | /----|------+--------+ + | +---+ +------+ | child bus A, on first set of pins + | |I2C|---|Pinmux| | + | +---+ +------+ | child bus B, on second set of pins + | \----|------+--------+--------+ + | | | | | + +------------------------+ +-----+ +-----+ +-----+ + | dev | | dev | | dev | + +-----+ +-----+ +-----+ + +Required properties: +- compatible: i2c-mux-pinctrl +- i2c-parent: The phandle of the I2C bus that this multiplexer's master-side + port is connected to. + +Also required are: + +* Standard pinctrl properties that specify the pin mux state for each child + bus. See ../pinctrl/pinctrl-bindings.txt. + +* Standard I2C mux properties. See mux.txt in this directory. + +* I2C child bus nodes. See mux.txt in this directory. + +For each named state defined in the pinctrl-names property, an I2C child bus +will be created. I2C child bus numbers are assigned based on the index into +the pinctrl-names property. + +The only exception is that no bus will be created for a state named "idle". If +such a state is defined, it must be the last entry in pinctrl-names. For +example: + + pinctrl-names = "ddc", "pta", "idle" -> ddc = bus 0, pta = bus 1 + pinctrl-names = "ddc", "idle", "pta" -> Invalid ("idle" not last) + pinctrl-names = "idle", "ddc", "pta" -> Invalid ("idle" not last) + +Whenever an access is made to a device on a child bus, the relevant pinctrl +state will be programmed into hardware. + +If an idle state is defined, whenever an access is not being made to a device +on a child bus, the idle pinctrl state will be programmed into hardware. + +If an idle state is not defined, the most recently used pinctrl state will be +left programmed into hardware whenever no access is being made of a device on +a child bus. + +Example: + + i2cmux { + compatible = "i2c-mux-pinctrl"; + #address-cells = <1>; + #size-cells = <0>; + + i2c-parent = <&i2c1>; + + pinctrl-names = "ddc", "pta", "idle"; + pinctrl-0 = <&state_i2cmux_ddc>; + pinctrl-1 = <&state_i2cmux_pta>; + pinctrl-2 = <&state_i2cmux_idle>; + + i2c@0 { + reg = <0>; + #address-cells = <1>; + #size-cells = <0>; + + eeprom { + compatible = "eeprom"; + reg = <0x50>; + }; + }; + + i2c@1 { + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + + eeprom { + compatible = "eeprom"; + reg = <0x50>; + }; + }; + }; + diff --git a/Documentation/devicetree/bindings/mtd/gpmi-nand.txt b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt new file mode 100644 index 000000000000..1a5bbd346d22 --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt @@ -0,0 +1,33 @@ +* Freescale General-Purpose Media Interface (GPMI) + +The GPMI nand controller provides an interface to control the +NAND flash chips. We support only one NAND chip now. + +Required properties: + - compatible : should be "fsl,<chip>-gpmi-nand" + - reg : should contain registers location and length for gpmi and bch. + - reg-names: Should contain the reg names "gpmi-nand" and "bch" + - interrupts : The first is the DMA interrupt number for GPMI. + The second is the BCH interrupt number. + - interrupt-names : The interrupt names "gpmi-dma", "bch"; + - fsl,gpmi-dma-channel : Should contain the dma channel it uses. + +The device tree may optionally contain sub-nodes describing partitions of the +address space. See partition.txt for more detail. + +Examples: + +gpmi-nand@8000c000 { + compatible = "fsl,imx28-gpmi-nand"; + #address-cells = <1>; + #size-cells = <1>; + reg = <0x8000c000 2000>, <0x8000a000 2000>; + reg-names = "gpmi-nand", "bch"; + interrupts = <88>, <41>; + interrupt-names = "gpmi-dma", "bch"; + fsl,gpmi-dma-channel = <4>; + + partition@0 { + ... + }; +}; diff --git a/Documentation/devicetree/bindings/mtd/mxc-nand.txt b/Documentation/devicetree/bindings/mtd/mxc-nand.txt new file mode 100644 index 000000000000..b5833d11c7be --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/mxc-nand.txt @@ -0,0 +1,19 @@ +* Freescale's mxc_nand + +Required properties: +- compatible: "fsl,imxXX-nand" +- reg: address range of the nfc block +- interrupts: irq to be used +- nand-bus-width: see nand.txt +- nand-ecc-mode: see nand.txt +- nand-on-flash-bbt: see nand.txt + +Example: + + nand@d8000000 { + compatible = "fsl,imx27-nand"; + reg = <0xd8000000 0x1000>; + interrupts = <29>; + nand-bus-width = <8>; + nand-ecc-mode = "hw"; + }; diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index ebaffe208ccb..56000b33340b 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -606,3 +606,9 @@ Why: There are two mci drivers: at91-mci and atmel-mci. The PDC support Who: Ludovic Desroches <ludovic.desroches@atmel.com> ---------------------------- + +What: net/wanrouter/ +When: June 2013 +Why: Unsupported/unmaintained/unused since 2.6 + +---------------------------- diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index d449e632e6a0..8e2da1e06e3b 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -61,6 +61,7 @@ ata *); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + void (*update_time)(struct inode *, struct timespec *, int); locking rules: all may block @@ -87,6 +88,8 @@ getxattr: no listxattr: no removexattr: yes fiemap: no +update_time: no + Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on victim. cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 912af6ce5626..fb0a6aeb936c 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -40,6 +40,7 @@ Table of Contents 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings 3.5 /proc/<pid>/mountinfo - Information about mounts 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm + 3.7 /proc/<pid>/task/<tid>/children - Information about task children 4 Configuring procfs 4.1 Mount options @@ -310,6 +311,11 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) start_data address above which program data+bss is placed end_data address below which program data+bss is placed start_brk address above which program heap can be expanded with brk() + arg_start address above which program command line is placed + arg_end address below which program command line is placed + env_start address above which program environment is placed + env_end address below which program environment is placed + exit_code the thread's exit_code in the form reported by the waitpid system call .............................................................................. The /proc/PID/maps file containing the currently mapped memory regions and @@ -1578,6 +1584,23 @@ then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated comm value. +3.7 /proc/<pid>/task/<tid>/children - Information about task children +------------------------------------------------------------------------- +This file provides a fast way to retrieve first level children pids +of a task pointed by <pid>/<tid> pair. The format is a space separated +stream of pids. + +Note the "first level" here -- if a child has own children they will +not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children +to obtain the descendants. + +Since this interface is intended to be fast and cheap it doesn't +guarantee to provide precise results and some children might be +skipped, especially if they've exited right after we printed their +pids, so one need to either stop or freeze processes being inspected +if precise results are needed. + + ------------------------------------------------------------------------------ Configuring procfs ------------------------------------------------------------------------------ diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index ef19f91a0f12..efd23f481704 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -363,6 +363,7 @@ struct inode_operations { ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); + void (*update_time)(struct inode *, struct timespec *, int); }; Again, all methods are called without any locks being held, unless @@ -471,6 +472,9 @@ otherwise noted. removexattr: called by the VFS to remove an extended attribute from a file. This method is called by removexattr(2) system call. + update_time: called by the VFS to update a specific time or the i_version of + an inode. If this is not defined the VFS will update the inode itself + and call mark_inode_dirty_sync. The Address Space Object ======================== diff --git a/Documentation/hwmon/coretemp b/Documentation/hwmon/coretemp index 84d46c0c71a3..c86b50c03ea8 100644 --- a/Documentation/hwmon/coretemp +++ b/Documentation/hwmon/coretemp @@ -6,7 +6,9 @@ Supported chips: Prefix: 'coretemp' CPUID: family 0x6, models 0xe (Pentium M DC), 0xf (Core 2 DC 65nm), 0x16 (Core 2 SC 65nm), 0x17 (Penryn 45nm), - 0x1a (Nehalem), 0x1c (Atom), 0x1e (Lynnfield) + 0x1a (Nehalem), 0x1c (Atom), 0x1e (Lynnfield), + 0x26 (Tunnel Creek Atom), 0x27 (Medfield Atom), + 0x36 (Cedar Trail Atom) Datasheet: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide http://softwarecommunity.intel.com/Wiki/Mobility/720.htm @@ -52,6 +54,17 @@ Some information comes from ark.intel.com Process Processor TjMax(C) +22nm Core i5/i7 Processors + i7 3920XM, 3820QM, 3720QM, 3667U, 3520M 105 + i5 3427U, 3360M/3320M 105 + i7 3770/3770K 105 + i5 3570/3570K, 3550, 3470/3450 105 + i7 3770S 103 + i5 3570S/3550S, 3475S/3470S/3450S 103 + i7 3770T 94 + i5 3570T 94 + i5 3470T 91 + 32nm Core i3/i5/i7 Processors i7 660UM/640/620, 640LM/620, 620M, 610E 105 i5 540UM/520/430, 540M/520/450/430 105 @@ -65,6 +78,11 @@ Process Processor TjMax(C) U3400 105 P4505/P4500 90 +32nm Atom Processors + Z2460 90 + D2700/2550/2500 100 + N2850/2800/2650/2600 100 + 45nm Xeon Processors 5400 Quad-Core X5492, X5482, X5472, X5470, X5460, X5450 85 E5472, E5462, E5450/40/30/20/10/05 85 @@ -85,6 +103,8 @@ Process Processor TjMax(C) N475/470/455/450 100 N280/270 90 330/230 125 + E680/660/640/620 90 + E680T/660T/640T/620T 110 45nm Core2 Processors Solo ULV SU3500/3300 100 diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index c45513d806ab..a92c5ebf373e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2543,6 +2543,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. sched_debug [KNL] Enables verbose scheduler debug messages. + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate + xtime_lock contention on larger systems, and/or RCU lock + contention on all systems with CONFIG_MAXSMP set. + Format: { "0" | "1" } + 0 -- disable. (may be 1 via CONFIG_CMDLINE="skew_tick=1" + 1 -- enable. + Note: increases power consumption, thus should only be + enabled if running jitter sensitive (HPC/RT) workloads. + security= [SECURITY] Choose a security module to enable at boot. If this boot parameter is not specified, only the first security module asking for security registration will be diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index ab1e8d7004c5..5cb9a1972460 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -10,8 +10,8 @@ Currently this network device driver is for all STM embedded MAC/GMAC (i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 FF1152AMT0221 D1215994A VIRTEX FPGA board. -DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether MAC 10/100 -Universal version 4.0 have been used for developing this driver. +DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether +MAC 10/100 Universal version 4.0 have been used for developing this driver. This driver supports both the platform bus and PCI. @@ -54,27 +54,27 @@ net_device structure enabling the scatter/gather feature. When one or more packets are received, an interrupt happens. The interrupts are not queued so the driver has to scan all the descriptors in the ring during the receive process. -This is based on NAPI so the interrupt handler signals only if there is work to be -done, and it exits. +This is based on NAPI so the interrupt handler signals only if there is work +to be done, and it exits. Then the poll method will be scheduled at some future point. The incoming packets are stored, by the DMA, in a list of pre-allocated socket buffers in order to avoid the memcpy (Zero-copy). 4.3) Timer-Driver Interrupt -Instead of having the device that asynchronously notifies the frame receptions, the -driver configures a timer to generate an interrupt at regular intervals. -Based on the granularity of the timer, the frames that are received by the device -will experience different levels of latency. Some NICs have dedicated timer -device to perform this task. STMMAC can use either the RTC device or the TMU -channel 2 on STLinux platforms. +Instead of having the device that asynchronously notifies the frame receptions, +the driver configures a timer to generate an interrupt at regular intervals. +Based on the granularity of the timer, the frames that are received by the +device will experience different levels of latency. Some NICs have dedicated +timer device to perform this task. STMMAC can use either the RTC device or the +TMU channel 2 on STLinux platforms. The timers frequency can be passed to the driver as parameter; when change it, take care of both hardware capability and network stability/performance impact. -Several performance tests on STM platforms showed this optimisation allows to spare -the CPU while having the maximum throughput. +Several performance tests on STM platforms showed this optimisation allows to +spare the CPU while having the maximum throughput. 4.4) WOL -Wake up on Lan feature through Magic and Unicast frames are supported for the GMAC -core. +Wake up on Lan feature through Magic and Unicast frames are supported for the +GMAC core. 4.5) DMA descriptors Driver handles both normal and enhanced descriptors. The latter has been only @@ -106,7 +106,8 @@ Several driver's information can be passed through the platform These are included in the include/linux/stmmac.h header file and detailed below as well: - struct plat_stmmacenet_data { +struct plat_stmmacenet_data { + char *phy_bus_name; int bus_id; int phy_addr; int interface; @@ -124,19 +125,24 @@ and detailed below as well: void (*bus_setup)(void __iomem *ioaddr); int (*init)(struct platform_device *pdev); void (*exit)(struct platform_device *pdev); + void *custom_cfg; + void *custom_data; void *bsp_priv; }; Where: + o phy_bus_name: phy bus name to attach to the stmmac. o bus_id: bus identifier. o phy_addr: the physical address can be passed from the platform. If it is set to -1 the driver will automatically detect it at run-time by probing all the 32 addresses. o interface: PHY device's interface. o mdio_bus_data: specific platform fields for the MDIO bus. - o pbl: the Programmable Burst Length is maximum number of beats to + o dma_cfg: internal DMA parameters + o pbl: the Programmable Burst Length is maximum number of beats to be transferred in one DMA transaction. GMAC also enables the 4xPBL by default. + o fixed_burst/mixed_burst/burst_len o clk_csr: fixed CSR Clock range selection. o has_gmac: uses the GMAC core. o enh_desc: if sets the MAC will use the enhanced descriptor structure. @@ -160,8 +166,9 @@ Where: this is sometime necessary on some platforms (e.g. ST boxes) where the HW needs to have set some PIO lines or system cfg registers. - o custom_cfg: this is a custom configuration that can be passed while - initialising the resources. + o custom_cfg/custom_data: this is a custom configuration that can be passed + while initialising the resources. + o bsp_priv: another private poiter. For MDIO bus The we have: @@ -180,7 +187,6 @@ Where: o irqs: list of IRQs, one per PHY. o probed_phy_irq: if irqs is NULL, use this for probed PHY. - For DMA engine we have the following internal fields that should be tuned according to the HW capabilities. diff --git a/Documentation/power/charger-manager.txt b/Documentation/power/charger-manager.txt index fdcca991df30..b4f7f4b23f64 100644 --- a/Documentation/power/charger-manager.txt +++ b/Documentation/power/charger-manager.txt @@ -44,6 +44,16 @@ Charger Manager supports the following: Normally, the platform will need to resume and suspend some devices that are used by Charger Manager. +* Support for premature full-battery event handling + If the battery voltage drops by "fullbatt_vchkdrop_uV" after + "fullbatt_vchkdrop_ms" from the full-battery event, the framework + restarts charging. This check is also performed while suspended by + setting wakeup time accordingly and using suspend_again. + +* Support for uevent-notify + With the charger-related events, the device sends + notification to users with UEVENT. + 2. Global Charger-Manager Data related with suspend_again ======================================================== In order to setup Charger Manager with suspend-again feature @@ -55,7 +65,7 @@ if there are multiple batteries. If there are multiple batteries, the multiple instances of Charger Manager share the same charger_global_desc and it will manage in-suspend monitoring for all instances of Charger Manager. -The user needs to provide all the two entries properly in order to activate +The user needs to provide all the three entries properly in order to activate in-suspend monitoring: struct charger_global_desc { @@ -74,6 +84,11 @@ bool (*rtc_only_wakeup)(void); same struct. If there is any other wakeup source triggered the wakeup, it should return false. If the "rtc" is the only wakeup reason, it should return true. + +bool assume_timer_stops_in_suspend; + : if true, Charger Manager assumes that + the timer (CM uses jiffies as timer) stops during suspend. Then, CM + assumes that the suspend-duration is same as the alarm length. }; 3. How to setup suspend_again @@ -111,6 +126,16 @@ enum polling_modes polling_mode; CM_POLL_CHARGING_ONLY: poll this battery if and only if the battery is being charged. +unsigned int fullbatt_vchkdrop_ms; +unsigned int fullbatt_vchkdrop_uV; + : If both have non-zero values, Charger Manager will check the + battery voltage drop fullbatt_vchkdrop_ms after the battery is fully + charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger + Manager will try to recharge the battery by disabling and enabling + chargers. Recharge with voltage drop condition only (without delay + condition) is needed to be implemented with hardware interrupts from + fuel gauges or charger devices/chips. + unsigned int fullbatt_uV; : If specified with a non-zero value, Charger Manager assumes that the battery is full (capacity = 100) if the battery is not being @@ -122,6 +147,8 @@ unsigned int polling_interval_ms; this battery every polling_interval_ms or more frequently. enum data_source battery_present; + : CM_BATTERY_PRESENT: assume that the battery exists. + CM_NO_BATTERY: assume that the battery does not exists. CM_FUEL_GAUGE: get battery presence information from fuel gauge. CM_CHARGER_STAT: get battery presence from chargers. @@ -151,7 +178,17 @@ bool measure_battery_temp; the value of measure_battery_temp. }; -5. Other Considerations +5. Notify Charger-Manager of charger events: cm_notify_event() +========================================================= +If there is an charger event is required to notify +Charger Manager, a charger device driver that triggers the event can call +cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. +In the function, psy is the charger driver's power_supply pointer, which is +associated with Charger-Manager. The parameter "type" +is the same as irq's type (enum cm_event_types). The event message "msg" is +optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS". + +6. Other Considerations ======================= At the charger/battery-related events such as battery-pulled-out, diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt index 9f16c5178b66..211831d4095f 100644 --- a/Documentation/power/power_supply_class.txt +++ b/Documentation/power/power_supply_class.txt @@ -84,6 +84,8 @@ are already charged or discharging, 'n/a' can be displayed (or HEALTH - represents health of the battery, values corresponds to POWER_SUPPLY_HEALTH_*, defined in battery.h. +VOLTAGE_OCV - open circuit voltage of the battery. + VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and minimal power supply voltages. Maximal/minimal means values of voltages when battery considered "full"/"empty" at normal conditions. Yes, there is diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index 88fd7f5c8dcd..13d6166d7a27 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -225,6 +225,13 @@ a queue must be less or equal then msg_max. maximum message size value (it is every message queue's attribute set during its creation). +/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the +default number of messages in a queue value if attr parameter of mq_open(2) is +NULL. If it exceed msg_max, the default value is initialized msg_max. + +/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting +the default message size value if attr parameter of mq_open(2) is NULL. If it +exceed msgsize_max, the default value is initialized msgsize_max. 4. /proc/sys/fs/epoll - Configuration options for the epoll interface -------------------------------------------------------- diff --git a/Documentation/vm/frontswap.txt b/Documentation/vm/frontswap.txt new file mode 100644 index 000000000000..37067cf455f4 --- /dev/null +++ b/Documentation/vm/frontswap.txt @@ -0,0 +1,278 @@ +Frontswap provides a "transcendent memory" interface for swap pages. +In some environments, dramatic performance savings may be obtained because +swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. + +(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" +and the only necessary changes to the core kernel for transcendent memory; +all other supporting code -- the "backends" -- is implemented as drivers. +See the LWN.net article "Transcendent memory in a nutshell" for a detailed +overview of frontswap and related kernel parts: +https://lwn.net/Articles/454795/ ) + +Frontswap is so named because it can be thought of as the opposite of +a "backing" store for a swap device. The storage is assumed to be +a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming +to the requirements of transcendent memory (such as Xen's "tmem", or +in-kernel compressed memory, aka "zcache", or future RAM-like devices); +this pseudo-RAM device is not directly accessible or addressable by the +kernel and is of unknown and possibly time-varying size. The driver +links itself to frontswap by calling frontswap_register_ops to set the +frontswap_ops funcs appropriately and the functions it provides must +conform to certain policies as follows: + +An "init" prepares the device to receive frontswap pages associated +with the specified swap device number (aka "type"). A "store" will +copy the page to transcendent memory and associate it with the type and +offset associated with the page. A "load" will copy the page, if found, +from transcendent memory into kernel memory, but will NOT remove the page +from from transcendent memory. An "invalidate_page" will remove the page +from transcendent memory and an "invalidate_area" will remove ALL pages +associated with the swap type (e.g., like swapoff) and notify the "device" +to refuse further stores with that swap type. + +Once a page is successfully stored, a matching load on the page will normally +succeed. So when the kernel finds itself in a situation where it needs +to swap out a page, it first attempts to use frontswap. If the store returns +success, the data has been successfully saved to transcendent memory and +a disk write and, if the data is later read back, a disk read are avoided. +If a store returns failure, transcendent memory has rejected the data, and the +page can be written to swap as usual. + +If a backend chooses, frontswap can be configured as a "writethrough +cache" by calling frontswap_writethrough(). In this mode, the reduction +in swap device writes is lost (and also a non-trivial performance advantage) +in order to allow the backend to arbitrarily "reclaim" space used to +store frontswap pages to more completely manage its memory usage. + +Note that if a page is stored and the page already exists in transcendent memory +(a "duplicate" store), either the store succeeds and the data is overwritten, +or the store fails AND the page is invalidated. This ensures stale data may +never be obtained from frontswap. + +If properly configured, monitoring of frontswap is done via debugfs in +the /sys/kernel/debug/frontswap directory. The effectiveness of +frontswap can be measured (across all swap devices) with: + +failed_stores - how many store attempts have failed +loads - how many loads were attempted (all should succeed) +succ_stores - how many store attempts have succeeded +invalidates - how many invalidates were attempted + +A backend implementation may provide additional metrics. + +FAQ + +1) Where's the value? + +When a workload starts swapping, performance falls through the floor. +Frontswap significantly increases performance in many such workloads by +providing a clean, dynamic interface to read and write swap pages to +"transcendent memory" that is otherwise not directly addressable to the kernel. +This interface is ideal when data is transformed to a different form +and size (such as with compression) or secretly moved (as might be +useful for write-balancing for some RAM-like devices). Swap pages (and +evicted page-cache pages) are a great use for this kind of slower-than-RAM- +but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and +cleancache) interface to transcendent memory provides a nice way to read +and write -- and indirectly "name" -- the pages. + +Frontswap -- and cleancache -- with a fairly small impact on the kernel, +provides a huge amount of flexibility for more dynamic, flexible RAM +utilization in various system configurations: + +In the single kernel case, aka "zcache", pages are compressed and +stored in local memory, thus increasing the total anonymous pages +that can be safely kept in RAM. Zcache essentially trades off CPU +cycles used in compression/decompression for better memory utilization. +Benchmarks have shown little or no impact when memory pressure is +low while providing a significant performance improvement (25%+) +on some workloads under high memory pressure. + +"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory +support for clustered systems. Frontswap pages are locally compressed +as in zcache, but then "remotified" to another system's RAM. This +allows RAM to be dynamically load-balanced back-and-forth as needed, +i.e. when system A is overcommitted, it can swap to system B, and +vice versa. RAMster can also be configured as a memory server so +many servers in a cluster can swap, dynamically as needed, to a single +server configured with a large amount of RAM... without pre-configuring +how much of the RAM is available for each of the clients! + +In the virtual case, the whole point of virtualization is to statistically +multiplex physical resources acrosst the varying demands of multiple +virtual machines. This is really hard to do with RAM and efforts to do +it well with no kernel changes have essentially failed (except in some +well-publicized special-case workloads). +Specifically, the Xen Transcendent Memory backend allows otherwise +"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple +virtual machines, but the pages can be compressed and deduplicated to +optimize RAM utilization. And when guest OS's are induced to surrender +underutilized RAM (e.g. with "selfballooning"), sudden unexpected +memory pressure may result in swapping; frontswap allows those pages +to be swapped to and from hypervisor RAM (if overall host system memory +conditions allow), thus mitigating the potentially awful performance impact +of unplanned swapping. + +A KVM implementation is underway and has been RFC'ed to lkml. And, +using frontswap, investigation is also underway on the use of NVM as +a memory extension technology. + +2) Sure there may be performance advantages in some situations, but + what's the space/time overhead of frontswap? + +If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into +nothingness and the only overhead is a few extra bytes per swapon'ed +swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" +registers, there is one extra global variable compared to zero for +every swap page read or written. If CONFIG_FRONTSWAP is enabled +AND a frontswap backend registers AND the backend fails every "store" +request (i.e. provides no memory despite claiming it might), +CPU overhead is still negligible -- and since every frontswap fail +precedes a swap page write-to-disk, the system is highly likely +to be I/O bound and using a small fraction of a percent of a CPU +will be irrelevant anyway. + +As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend +registers, one bit is allocated for every swap page for every swap +device that is swapon'd. This is added to the EIGHT bits (which +was sixteen until about 2.6.34) that the kernel already allocates +for every swap page for every swap device that is swapon'd. (Hugh +Dickins has observed that frontswap could probably steal one of +the existing eight bits, but let's worry about that minor optimization +later.) For very large swap disks (which are rare) on a standard +4K pagesize, this is 1MB per 32GB swap. + +When swap pages are stored in transcendent memory instead of written +out to disk, there is a side effect that this may create more memory +pressure that can potentially outweigh the other advantages. A +backend, such as zcache, must implement policies to carefully (but +dynamically) manage memory limits to ensure this doesn't happen. + +3) OK, how about a quick overview of what this frontswap patch does + in terms that a kernel hacker can grok? + +Let's assume that a frontswap "backend" has registered during +kernel initialization; this registration indicates that this +frontswap backend has access to some "memory" that is not directly +accessible by the kernel. Exactly how much memory it provides is +entirely dynamic and random. + +Whenever a swap-device is swapon'd frontswap_init() is called, +passing the swap device number (aka "type") as a parameter. +This notifies frontswap to expect attempts to "store" swap pages +associated with that number. + +Whenever the swap subsystem is readying a page to write to a swap +device (c.f swap_writepage()), frontswap_store is called. Frontswap +consults with the frontswap backend and if the backend says it does NOT +have room, frontswap_store returns -1 and the kernel swaps the page +to the swap device as normal. Note that the response from the frontswap +backend is unpredictable to the kernel; it may choose to never accept a +page, it could accept every ninth page, or it might accept every +page. But if the backend does accept a page, the data from the page +has already been copied and associated with the type and offset, +and the backend guarantees the persistence of the data. In this case, +frontswap sets a bit in the "frontswap_map" for the swap device +corresponding to the page offset on the swap device to which it would +otherwise have written the data. + +When the swap subsystem needs to swap-in a page (swap_readpage()), +it first calls frontswap_load() which checks the frontswap_map to +see if the page was earlier accepted by the frontswap backend. If +it was, the page of data is filled from the frontswap backend and +the swap-in is complete. If not, the normal swap-in code is +executed to obtain the page of data from the real swap device. + +So every time the frontswap backend accepts a page, a swap device read +and (potentially) a swap device write are replaced by a "frontswap backend +store" and (possibly) a "frontswap backend loads", which are presumably much +faster. + +4) Can't frontswap be configured as a "special" swap device that is + just higher priority than any real swap device (e.g. like zswap, + or maybe swap-over-nbd/NFS)? + +No. First, the existing swap subsystem doesn't allow for any kind of +swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy, +but this would require fairly drastic changes. Even if it were +rewritten, the existing swap subsystem uses the block I/O layer which +assumes a swap device is fixed size and any page in it is linearly +addressable. Frontswap barely touches the existing swap subsystem, +and works around the constraints of the block I/O subsystem to provide +a great deal of flexibility and dynamicity. + +For example, the acceptance of any swap page by the frontswap backend is +entirely unpredictable. This is critical to the definition of frontswap +backends because it grants completely dynamic discretion to the +backend. In zcache, one cannot know a priori how compressible a page is. +"Poorly" compressible pages can be rejected, and "poorly" can itself be +defined dynamically depending on current memory constraints. + +Further, frontswap is entirely synchronous whereas a real swap +device is, by definition, asynchronous and uses block I/O. The +block I/O layer is not only unnecessary, but may perform "optimizations" +that are inappropriate for a RAM-oriented device including delaying +the write of some pages for a significant amount of time. Synchrony is +required to ensure the dynamicity of the backend and to avoid thorny race +conditions that would unnecessarily and greatly complicate frontswap +and/or the block I/O subsystem. That said, only the initial "store" +and "load" operations need be synchronous. A separate asynchronous thread +is free to manipulate the pages stored by frontswap. For example, +the "remotification" thread in RAMster uses standard asynchronous +kernel sockets to move compressed frontswap pages to a remote machine. +Similarly, a KVM guest-side implementation could do in-guest compression +and use "batched" hypercalls. + +In a virtualized environment, the dynamicity allows the hypervisor +(or host OS) to do "intelligent overcommit". For example, it can +choose to accept pages only until host-swapping might be imminent, +then force guests to do their own swapping. + +There is a downside to the transcendent memory specifications for +frontswap: Since any "store" might fail, there must always be a real +slot on a real swap device to swap the page. Thus frontswap must be +implemented as a "shadow" to every swapon'd device with the potential +capability of holding every page that the swap device might have held +and the possibility that it might hold no pages at all. This means +that frontswap cannot contain more pages than the total of swapon'd +swap devices. For example, if NO swap device is configured on some +installation, frontswap is useless. Swapless portable devices +can still use frontswap but a backend for such devices must configure +some kind of "ghost" swap device and ensure that it is never used. + +5) Why this weird definition about "duplicate stores"? If a page + has been previously successfully stored, can't it always be + successfully overwritten? + +Nearly always it can, but no, sometimes it cannot. Consider an example +where data is compressed and the original 4K page has been compressed +to 1K. Now an attempt is made to overwrite the page with data that +is non-compressible and so would take the entire 4K. But the backend +has no more space. In this case, the store must be rejected. Whenever +frontswap rejects a store that would overwrite, it also must invalidate +the old data and ensure that it is no longer accessible. Since the +swap subsystem then writes the new data to the read swap device, +this is the correct course of action to ensure coherency. + +6) What is frontswap_shrink for? + +When the (non-frontswap) swap subsystem swaps out a page to a real +swap device, that page is only taking up low-value pre-allocated disk +space. But if frontswap has placed a page in transcendent memory, that +page may be taking up valuable real estate. The frontswap_shrink +routine allows code outside of the swap subsystem to force pages out +of the memory managed by frontswap and back into kernel-addressable memory. +For example, in RAMster, a "suction driver" thread will attempt +to "repatriate" pages sent to a remote machine back to the local machine; +this is driven using the frontswap_shrink mechanism when memory pressure +subsides. + +7) Why does the frontswap patch create the new include file swapfile.h? + +The frontswap code depends on some swap-subsystem-internal data +structures that have, over the years, moved back and forth between +static and global. This seemed a reasonable compromise: Define +them as global but declare them in a new include file that isn't +included by the large number of source files that include swap.h. + +Dan Magenheimer, last updated April 9, 2012 diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 4600cbe3d6be..7587493c67f1 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -16,7 +16,7 @@ There are three components to pagemap: * Bits 0-4 swap type if swapped * Bits 5-54 swap offset if swapped * Bits 55-60 page shift (page size = 1<<page shift) - * Bit 61 reserved for future use + * Bit 61 page is file-page or shared-anon * Bit 62 page swapped * Bit 63 page present diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt index 6752870c4970..b0c6d1bbb434 100644 --- a/Documentation/vm/slub.txt +++ b/Documentation/vm/slub.txt @@ -17,7 +17,7 @@ data and perform operation on the slabs. By default slabinfo only lists slabs that have data in them. See "slabinfo -h" for more options when running the command. slabinfo can be compiled with -gcc -o slabinfo tools/slub/slabinfo.c +gcc -o slabinfo tools/vm/slabinfo.c Some of the modes of operation of slabinfo require that slub debugging be enabled on the command line. F.e. no tracking information will be diff --git a/Documentation/x86/efi-stub.txt b/Documentation/x86/efi-stub.txt new file mode 100644 index 000000000000..44e6bb6ead10 --- /dev/null +++ b/Documentation/x86/efi-stub.txt @@ -0,0 +1,65 @@ + The EFI Boot Stub + --------------------------- + +On the x86 platform, a bzImage can masquerade as a PE/COFF image, +thereby convincing EFI firmware loaders to load it as an EFI +executable. The code that modifies the bzImage header, along with the +EFI-specific entry point that the firmware loader jumps to are +collectively known as the "EFI boot stub", and live in +arch/x86/boot/header.S and arch/x86/boot/compressed/eboot.c, +respectively. + +By using the EFI boot stub it's possible to boot a Linux kernel +without the use of a conventional EFI boot loader, such as grub or +elilo. Since the EFI boot stub performs the jobs of a boot loader, in +a certain sense it *IS* the boot loader. + +The EFI boot stub is enabled with the CONFIG_EFI_STUB kernel option. + + +**** How to install bzImage.efi + +The bzImage located in arch/x86/boot/bzImage must be copied to the EFI +System Partiion (ESP) and renamed with the extension ".efi". Without +the extension the EFI firmware loader will refuse to execute it. It's +not possible to execute bzImage.efi from the usual Linux file systems +because EFI firmware doesn't have support for them. + + +**** Passing kernel parameters from the EFI shell + +Arguments to the kernel can be passed after bzImage.efi, e.g. + + fs0:> bzImage.efi console=ttyS0 root=/dev/sda4 + + +**** The "initrd=" option + +Like most boot loaders, the EFI stub allows the user to specify +multiple initrd files using the "initrd=" option. This is the only EFI +stub-specific command line parameter, everything else is passed to the +kernel when it boots. + +The path to the initrd file must be an absolute path from the +beginning of the ESP, relative path names do not work. Also, the path +is an EFI-style path and directory elements must be separated with +backslashes (\). For example, given the following directory layout, + +fs0:> + Kernels\ + bzImage.efi + initrd-large.img + + Ramdisks\ + initrd-small.img + initrd-medium.img + +to boot with the initrd-large.img file if the current working +directory is fs0:\Kernels, the following command must be used, + + fs0:\Kernels> bzImage.efi initrd=\Kernels\initrd-large.img + +Notice how bzImage.efi can be specified with a relative path. That's +because the image we're executing is interpreted by the EFI shell, +which understands relative paths, whereas the rest of the command line +is passed to bzImage.efi. |
