This is current state of the patch series for people to comment on. I am using open firmware naming scheme to specify device path names. In this submission I addressed all comment from previous one and added option rom support and rebased to qemu upstream. Kevin can you double check that the names are usable by Seabios? Reading PC boot specification it looks like Seabios will not be able to take full advantage of this though. Only one BCV can be bootable, so only disk with lowest boot index will be bootable by Seabios. Is this correct? Names look like this on pci machine: /pci@i0cf8/ide@1,1/drive@1/disk@0 /pci@i0cf8/isa@1/fdc@03f1/floppy@1 /pci@i0cf8/isa@1/fdc@03f1/floppy@0 /pci@i0cf8/ide@1,1/drive@1/disk@1 /pci@i0cf8/ide@1,1/drive@0/disk@0 /pci@i0cf8/scsi@3/disk@0 /pci@i0cf8/ethernet@4/ethernet-phy@0 /pci@i0cf8/ethernet@5/ethernet-phy@0 /pci@i0cf8/ide@1,1/drive@0/disk@1 /pci@i0cf8/isa@1/ide@01e8/drive@0/disk@0 /pci@i0cf8/usb@1,2/network@0/ethernet@0 /pci@i0cf8/usb@1,2/hub@1/network@0/ethernet@0 /rom@genroms/linuxboot.bin and on isa machine: /isa/ide@0170/drive@0/disk@0 /isa/fdc@03f1/floppy@1 /isa/fdc@03f1/floppy@0 /isa/ide@0170/drive@0/disk@1 Instead of using get_dev_path() callback I introduces another one get_fw_dev_path. Unfortunately the way get_dev_path() callback is used in migration code makes it hard to reuse it for other purposes. First of all it is not called recursively so caller expects it to provide unique name by itself. Device path though is inherently recursive. Each individual element may not be unique, but the whole path will be. On the other hand to call get_dev_path() recursively in migration code we should implement it for all possible buses first. Other problem is compatibility. If we change get_dev_path() output format now we will not be able to migrate from old qemu to new one without some additional compatibility layer. Gleb Natapov (15): Introduce fw_name field to DeviceInfo structure. Introduce new BusInfo callback ...
Action that depends on fully initialized device model should register
with this notifier chain.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
sysemu.h | 2 ++
vl.c | 15 +++++++++++++++
2 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/sysemu.h b/sysemu.h
index 48f8eee..c42f33a 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -60,6 +60,8 @@ void qemu_system_reset(void);
void qemu_add_exit_notifier(Notifier *notify);
void qemu_remove_exit_notifier(Notifier *notify);
+void qemu_add_machine_init_done_notifier(Notifier *notify);
+
void do_savevm(Monitor *mon, const QDict *qdict);
int load_vmstate(const char *name);
void do_delvm(Monitor *mon, const QDict *qdict);
diff --git a/vl.c b/vl.c
index e8ada75..918d988 100644
--- a/vl.c
+++ b/vl.c
@@ -253,6 +253,9 @@ static void *boot_set_opaque;
static NotifierList exit_notifiers =
NOTIFIER_LIST_INITIALIZER(exit_notifiers);
+static NotifierList machine_init_done_notifiers =
+ NOTIFIER_LIST_INITIALIZER(machine_init_done_notifiers);
+
int kvm_allowed = 0;
uint32_t xen_domid;
enum xen_mode xen_mode = XEN_EMULATE;
@@ -1778,6 +1781,16 @@ static void qemu_run_exit_notifiers(void)
notifier_list_notify(&exit_notifiers);
}
+void qemu_add_machine_init_done_notifier(Notifier *notify)
+{
+ notifier_list_add(&machine_init_done_notifiers, notify);
+}
+
+static void qemu_run_machine_init_done_notifiers(void)
+{
+ notifier_list_notify(&machine_init_done_notifiers);
+}
+
static const QEMUOption *lookup_opt(int argc, char **argv,
const char **poptarg, int *poptind)
{
@@ -3023,6 +3036,8 @@ int main(int argc, char **argv, char **envp)
exit(1);
}
+ qemu_run_machine_init_done_notifiers();
+
qemu_system_reset();
if (loadvm) {
if (load_vmstate(loadvm) < 0) {
--
1.7.1
--
Change fw_cfg_add_file() to get full file path as a parameter instead
of building one internally. Two reasons for that. First caller may need
to know how file is named. Second this moves policy of file naming out
from fw_cfg. Platform may want to use more then two levels of
directories for instance.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
hw/fw_cfg.c | 16 ++++------------
hw/fw_cfg.h | 4 ++--
hw/loader.c | 16 ++++++++++++++--
3 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c
index 72866ae..7b9434f 100644
--- a/hw/fw_cfg.c
+++ b/hw/fw_cfg.c
@@ -277,10 +277,9 @@ int fw_cfg_add_callback(FWCfgState *s, uint16_t key, FWCfgCallback callback,
return 1;
}
-int fw_cfg_add_file(FWCfgState *s, const char *dir, const char *filename,
- uint8_t *data, uint32_t len)
+int fw_cfg_add_file(FWCfgState *s, const char *filename, uint8_t *data,
+ uint32_t len)
{
- const char *basename;
int i, index;
if (!s->files) {
@@ -297,15 +296,8 @@ int fw_cfg_add_file(FWCfgState *s, const char *dir, const char *filename,
fw_cfg_add_bytes(s, FW_CFG_FILE_FIRST + index, data, len);
- basename = strrchr(filename, '/');
- if (basename) {
- basename++;
- } else {
- basename = filename;
- }
-
- snprintf(s->files->f[index].name, sizeof(s->files->f[index].name),
- "%s/%s", dir, basename);
+ pstrcpy(s->files->f[index].name, sizeof(s->files->f[index].name),
+ filename);
for (i = 0; i < index; i++) {
if (strcmp(s->files->f[index].name, s->files->f[i].name) == 0) {
FW_CFG_DPRINTF("%s: skip duplicate: %s\n", __FUNCTION__,
diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h
index 4d13a4f..856bf91 100644
--- a/hw/fw_cfg.h
+++ b/hw/fw_cfg.h
@@ -60,8 +60,8 @@ int fw_cfg_add_i32(FWCfgState *s, uint16_t key, uint32_t value);
int fw_cfg_add_i64(FWCfgState *s, uint16_t key, uint64_t value);
int ...Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
hw/fw_cfg.c | 14 ++++++++++++++
hw/fw_cfg.h | 4 +++-
sysemu.h | 1 +
vl.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 69 insertions(+), 1 deletions(-)
diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c
index 7b9434f..f6a67db 100644
--- a/hw/fw_cfg.c
+++ b/hw/fw_cfg.c
@@ -53,6 +53,7 @@ struct FWCfgState {
FWCfgFiles *files;
uint16_t cur_entry;
uint32_t cur_offset;
+ Notifier machine_ready;
};
static void fw_cfg_write(FWCfgState *s, uint8_t value)
@@ -315,6 +316,15 @@ int fw_cfg_add_file(FWCfgState *s, const char *filename, uint8_t *data,
return 1;
}
+static void fw_cfg_machine_ready(struct Notifier* n)
+{
+ uint32_t len;
+ char *bootindex = get_boot_devices_list(&len);
+
+ fw_cfg_add_bytes(container_of(n, FWCfgState, machine_ready),
+ FW_CFG_BOOTINDEX, (uint8_t*)bootindex, len);
+}
+
FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
target_phys_addr_t ctl_addr, target_phys_addr_t data_addr)
{
@@ -343,6 +353,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
fw_cfg_add_i16(s, FW_CFG_MAX_CPUS, (uint16_t)max_cpus);
fw_cfg_add_i16(s, FW_CFG_BOOT_MENU, (uint16_t)boot_menu);
+
+ s->machine_ready.notify = fw_cfg_machine_ready;
+ qemu_add_machine_init_done_notifier(&s->machine_ready);
+
return s;
}
diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h
index 856bf91..4d61410 100644
--- a/hw/fw_cfg.h
+++ b/hw/fw_cfg.h
@@ -30,7 +30,9 @@
#define FW_CFG_FILE_FIRST 0x20
#define FW_CFG_FILE_SLOTS 0x10
-#define FW_CFG_MAX_ENTRY (FW_CFG_FILE_FIRST+FW_CFG_FILE_SLOTS)
+#define FW_CFG_FILE_LAST_SLOT (FW_CFG_FILE_FIRST+FW_CFG_FILE_SLOTS)
+#define FW_CFG_BOOTINDEX (FW_CFG_FILE_LAST_SLOT + 1)
+#define FW_CFG_MAX_ENTRY FW_CFG_BOOTINDEX
#define FW_CFG_WRITE_CHANNEL 0x4000
#define FW_CFG_ARCH_LOCAL ...Might not fit: with pci we can have 256 nested buses. devpath is allocated with strdup, not qemu_malloc, so I guess it should be freed with free? Alternatively, let's add qemu_strdup --
Will be harder for Seabios. I can use more then one byte for length, but Nah, not at all. -- Gleb. --
Why not? It's easy to specify this on qemu command line. You do nothing to detect this and gracefully fail either, do you? -- MST --
I think it will be easier if we don't try to do this in one pass. 1. pass: calculate total length and # of devices 2. allocate --
I started to implement this to OpenBIOS but I noticed a small issue. First the first byte must be read to determine length. Then the read routine will be called again to read the correct amount of bytes. This would work, but since there is no shortage of IDs, I'd prefer a system where one ID is used to query the length and another ID is used to read the data, without the length byte. This is similar how command line, initrd etc. are handled. This would have the advantage that since fw_cfg uses little endian format, the length value would easily scale to for example 64 bits to support terabytes of boot device lists. ;-) --
Yea. Let's just print # of devices as a property, in ASCII. No endian-ness, no nothing. Also - can we just NULL-terminate each ID? --
No, we should use LE numbers like other IDs. To be more specific, this is what I meant (instead of FW_CFG_BOOTINDEX): FW_CFG_BOOTINDEX_LEN: get LE integer length of the boot device data. FW_CFG_BOOTINDEX_DATA: get the boot device data as NUL terminated C strings, all strings back-to-back. The reader can determine number of strings. --
This should be
#define FW_CFG_MAX_ENTRY (FW_CFG_BOOTINDEX + 1)
because the check is like this:
if ((key & FW_CFG_ENTRY_MASK) >= FW_CFG_MAX_ENTRY) {
s->cur_entry = FW_CFG_INVALID;
With that change, I got the bootindex passed to OpenBIOS:
OpenBIOS for Sparc64
Configuration device id QEMU version 1 machine id 0
kernel cmdline
CPUs: 1 x SUNW,UltraSPARC-IIi
UUID: 00000000-0000-0000-0000-000000000000
bootindex num_strings 1
bootindex /pbm@000001fe00000000/ide@5/drive@1/disk@0
The device path does not match exactly, but it's close:
/pci@1fe,0/pci-ata@5/ide1@600/disk@0
--
pbm->pci should be solvable by the patch at the end. Were in the spec
it is allowed to abbreviate 1fe00000000 as 1fe,0? Spec allows to drop
starting zeroes but TARGET_FMT_plx definition in targphys.h has 0 after
%. I can define another one without leading zeroes. Can you suggest
a name? TARGET_FMT_lx is poisoned. As of ATA there is no open firmware
binding spec for ATA, so everyone does what he pleases. I based my
implementation on what open firmware showing when running on qemu x86.
"pci-ata" should be "ide" according to PCI binding spec :)
diff --git a/hw/apb_pci.c b/hw/apb_pci.c
index c619112..643aa49 100644
--- a/hw/apb_pci.c
+++ b/hw/apb_pci.c
@@ -453,6 +453,7 @@ static PCIDeviceInfo pbm_pci_host_info = {
static SysBusDeviceInfo pbm_host_info = {
.qdev.name = "pbm",
+ .qdev.fw_name = "pci",
.qdev.size = sizeof(APBState),
.qdev.reset = pci_pbm_reset,
.init = pci_pbm_init_device,
--
Gleb.
--
I think OpenBIOS for Sparc64 is not correct here, so it may be a bad reference architecture. OBP on a real Ultra-5 used a path like this: /pci@1f,0/pci@1,1/ide@3/disk@0,0 pci@1f,0 specifies the PCI host bridge at UPA bus port ID of 0x1f. Yes, for example there is no ATA in the Ultra-5 tree but in UltraAX it exists: Perhaps the FW path should use device class names if no name is specified. I'll try Sparc32 to see how this fits there. --
According to device name qemu creates pci controller is memory mapped at address 1fe00000000 and by looking at the code I can see that this What do you mean by "device class name". We can do something like this: if (dev->child_bus.lh_first) return dev->child_bus.lh_first->info->name; i.e if there is child bus use its bus name as fw name. This will make all pci devices to have "pci" as fw name automatically. The problem is -- Gleb. --
I meant PCI class name, like "display" for display controllers, Except bootindex is not implemented for SCSI. --
Will look into adding it. -- Gleb. --
Thanks. The bootindex on Sparc32 looks like this: bootindex /esp@0000000078800000/disk@1,0 /ethernet@ffffffffffffffff/ethernet-phy@0 I don't think I got Lance setup right. OF paths for the devices would be: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000/sd@1,0 /iommu@0,10000000/sbus@0,10001000/ledma@5,8400010/le@5,8c00000 The logic for ESP is that ESP (registers at 0x78800000, slot offset 0x880000) is handled by the DMA controller (registers at 0x78400000, slot offset 0x840000), they are in a SBus slot #5, and SBus (registers at 0x10001000) is in turn handled by IOMMU (registers at 0x10000000). Lance should be handled the same way. This hierarchy is partly known by QEMU because DMA accesses use this flow, but not otherwise. There is no concept of SBus slots, DMA talks to IOMMU directly. Though in this case both ESP, Lance and their DMA controllers are on board devices in a MACIO chip. It may be possible to add the hierarchy information at each stage. It should also be possible for BIOS to determine the device just from the physical address if we ignored OF compatibility. --
For arches other then x86 there is a lot of work left to be done :) For starter exotic sparc buses should get their own get_fw_dev_path() If qdev hierarchy does not correspond to real HW there is no much we can It would be nice to be OF compatible at least at some level. Of course OF spec is not strict enough to have two different implementations produce exactly same device path that can be compared by strcpy. Can we apply the series now? At least for x86 it provides useful paths and work can be continue for other arches by interested parties. -- Gleb. --
That's bad. This raises a concern: if these paths expose qdev Something I only now realized is that we commit to never changing the paths for any architecture that supports migration. -- MST --
The path expose internal HW hierarchy. It is designed to do so. Qdev designed to do the same: describe HW hierarchy. If qdev fails to do so it is broken. I do not see connection to migration at all since the path is No connection to migration whatsoever. -- Gleb. --
Yes. But since you use qdev to build up the path, a broken The connection is that if we pass the list with path 1 which you define as broken to BIOS, then migrate to a machine with an updated qemu which has a correct path, BIOS won't be able to complete the boot. Right? Same in reverse direction. As solution could be a fuzzy matching It just seems silly to use different paths for the same thing. Besides the connection above, I was hoping to use these paths for section names in migration. If we can't guarantee they are stable, we'll have to roll our own, and if we do this, with stability guarantees required for migration format, --
Qdev bug. Fix it like any other bug. The nice is that when you compare You solve it like you solve all such issue with -M machine type. But the problem exists only if migration happens in a short window between start of the boot process and BIOS reading boot order string. It doesn't matter what do you use for migration purposes as long as it depend on qdev hierarchy it will have problem when qdev hierarchy changes and if it doesn't you can't produce unique names reliably. -- Gleb. --
So that's unavoidable if we think paths are correct. But if we know they are wrong, we are better off No I mean qemu could do matching fuzzily. This way if we get a path from the old BIOS we can We can, it's not like OF is the only way to enumerate. We could have driver-specific paths for example, exactly like we currently have. I.e. paths don't have to be globally unique because each driver has it's own domain. It seems cleaner to use an existing spec but we must figure out how it will not become a support issue. -- MST --
They are correct for x86. My patch set does not even tries to cover all HW. If sparc want to use them to it better be fixed. Or if there is enough Qemu does not take paths from BIOS so I don't know what are you talking What we have currently is not even close to be correct. It happens to work since it is implemented only for one bus type and we can have only one of this bus right now. And of course it is not suitable for passing If you think you can figure out how to describe device path (or even give globally unique name to device) and to not depend on internal qdev implementation go ahead and do that for migration. -- Gleb. --
Nasty as in hard to reproduce. -- MST --
It is very easy to reproduce if you know what you are looking for :). Just stick sleep() in correct place in the BIOS. -- Gleb. --
Why not just return a newline separated list that is null terminated? -Kevin --
Doing it like this will needlessly complicate firmware side. How do you
know how much memory to allocate before reading device list? Doing it
like Blue suggest (have BOOTINDEX_LEN and BOOTINDEX_STRING) solves this.
To create nice array from bootindex string you firmware will still have
to do additional pass on it though. With format like above the code
would look like that:
qemu_cfg_read(&n, 1);
arr = alloc(n);
for (i=0; i<n; i++) {
qemu_cfg_read(&l, 1);
arr[i] = zalloc(l+1);
qemu_cfg_read(arr[i], l);
}
--
Gleb.
--
At this point I don't care about format. But I would like one without 1-byte-length limitations, just so we can cover whatever pci can through at us. -- MST --
To do memory scan you need to read it into memory first. To read it into memory you need to know how much memory to allocate to know how much memory to allocate you meed to do memory scan... Notice pattern here :) Of course you can scan IO space too discarding everything you read first More code, each line of code potentially introduce bug. But I will go with I agree. 1-byte for one device string may be to limiting. It is still more then 15 PCI bridges on a PC and if you have your pci bus go that deep you are doing something very wrong. But according to spec device name can be 32 byte long and device address may be 64bit physical address and that makes length of one device element to be 50 byte. -- Gleb. --
My preference would be for the size to be exposed via the QEMU_CFG_FILE_DIR selector. (My preference would be for all objects in fw_cfg to have entries in QEMU_CFG_FILE_DIR describing their size in a reliable manner.) -Kevin --
Will interface suggested by Blue will be good for you? The one with two fw_cfg ids. BOOTINDEX_LEN for len and BOOTINDEX_DATA for device list. I already changed my implementation to this one. Using FILE_DIR requires us to generate synthetic name. Hmm BTW I do not see proper endianness handling in FILE_DIR. -- Gleb. --
That's just me. Everything it OK there with endianness. -- Gleb. --
I dislike how different fw_cfg objects pass the length in different ways (eg, QEMU_CFG_E820_TABLE passes length as first 4 bytes). This is a common problem - I'd prefer if we could adopt one uniform way of passing length. I think QEMU_CFG_FILE_DIR solves this problem well. I also have an ulterior motive here. If the boot order is exposed as a newline separated list via an entry in QEMU_CFG_FILE_DIR, then this becomes free for coreboot users as well. (On coreboot, the boot order could be placed in a "file" in flash with no change to the seabios code.) -Kevin --
Looking at available fw cfg option I see that _SIZE _DATA is also a common pattern. The problem with QEMU_CFG_FILE_DIR is that we have very little available slots right now. If we a going to require everything to use it we better grow number of available slots considerably now while it is easily done (no option defined above file slots yet). I personally do not have preferences one way or the other. Blue are you You can define get_boot_order() function and implement it differently for qemu and coreboot. For coreboot it will be one linear. Just call cbfs_copyfile("bootorder"). BTW why newline separation is important? -- Gleb. --
Sure, but it'd be nice to just use romfile_copy("bootorder"). Using newline separated just makes it easier for users to vi and/or cat the file. -Kevin --
FW_CFG_FILE_DIR seems to be a bit poorly designed. Maybe we should deprecate it and design a more scalable model. There are also string variables passed to BIOS (-prom-env for Sparc/PPC) which could then Newline and zero are both OK since neither can appear inside a valid boot path. --
