MONITORING OSDS AND PGS——监视OSDS和PGS
注:
在一个集群的一部分故障可能会阻止你访问一个特定的对象,但是,这并不意味着你不能访问其他对象。当你遇到一个故障时,不要惊慌。只要按照步骤监测的OSD和安置组。然后,开始故障排除。
Ceph的是一般的自我修复。然而,问题依然存在时,监控OSD和安置组将帮助你识别问题。
MONITORING OSDS——监控OSD
一个OSD的状态是要么在集群中,要么脱离集群,或者,它要么是启动和运行,或者是挂掉,未运行。如果OSD 在运行,它可能是在集群中(你可以读取和写入数据)或脱离集群。如果它是 在集群中,将要移出集群,Ceph的布置组迁移到其他的OSD。如果OSD是脱离集群的,CRUSH算法不会分配安置组到一个OSD。如果OSD挂掉,也应该脱离集群。
Note
I如果一个OSD挂掉,但是在 集群中,那么这个集群将不是处在一个健康的状态。
如果执行命令,如 Ceph health ,ceph -s 或 ceph -w ,你可能会注意到,集群并不总是回显 HEALTH_OK。不要惊慌。在OSD方面,群集将不会回显 Health_OK,在以下几种情况:
- 你还没有开始集群(它不会响应)。
- 你刚开始或重新启动集群,它是没有准备好,因为安置组正在被创建,OSD是在窥视的过程中。
- 您只需添加或删除的OSD。
- 你刚刚修改了集群映射图。
监控OSDS的一个非常重要方面是确保在集群和运行的集群中,所有的OSD 跑起来和运行中。要看到,所有的OSD运行,执行:
ceph osd stat
结果是以一个epoch图的形式告诉你(eNNNN),OSD(x)的总数是多事,有多少是up(Y),有多少是in(Z)。
eNNNN: x osds: y up, z in
If the number of OSDs that are in the cluster is more than the number of OSDs that are up, execute the following command to identify theceph-osd daemons that aren’t running:
如果在集群中OSDS的数量超过了正在up的OSDS的数量,执行下面的命令,识别出那些ceph-osd进程没有在运行。
ceph osd tree
dumped osdmap tree epoch 1
# id weight type name up/down reweight
-1 2 pool openstack
-3 2 rack dell-2950-rack-A
-2 2 host dell-2950-A1
0 1 osd.0 up 1
1 1 osd.1 down 1
Tip
精心设计的搜索层次能帮助你尽快的找到集群中问题所在,更快的找到问题。
如果一个OSD挂掉了,要启动它:
sudo /etc/init.d/ceph -a start osd.1
PG SETS——PG设置
当CRUSH分配安置组的OSD,它主要盯在泳池副本的数量和分配安置组到OSDS,每个副本都会被分配到不同的OSD等。例如,如果池需要三个副本放置组,CRUSH可能将他们分配到 osd.1 osd.2和osd.3。CRUSH其实是一个伪随机分配,它考虑CRUSH图中设置的故障域,所以你很少会看到在一个大集群中,布置组被分配到最近的邻居的OSD。我们指到设定的OSD应包含代理设置一个特定的安置组的副本。在某些情况下,代理设置OSD挂掉或以其他方式无法对安置组对象的服务请求。出现这种情况时,不要惊慌。常见的例子包括:
- 您正在添加或删除OSD。然后,CRUSH重新分配安置组给其他的OSD从而改变组成的代理设置和“回填”的过程中的数据迁移。
- 一个OSD 被restared,现在恢复。
- 一个正在执行动作的OSD挂掉,或无法接收服务请求,另一个的OSD临时使用其职责。
Ceph processes a client request using the Up Set, which is the set of OSDs that will actually handle the requests. In most cases, the Up Set and the Acting Set are virtually identical. When they are not, it may indicate that Ceph is migrating data, an OSD is recovering, or that there is a problem (i.e., Ceph usually echoes a “HEALTH WARN” state with a “stuck stale” message in such scenarios).
To retrieve a list of placement groups, execute:
Ceph的处理的客户端请求,使用的Up Set,这是一组,将实际处理请求的OSD。在大多数情况下,Up Set和the Acting Set,实际上是相同的。当他们都不是这种情况时,它可能表明Ceph的数据迁移,OSD正在恢复,或者说是一个问题(即,通常Ceph响应消息“HEALTH WARN”与“stuck stale”的消息)。
要检索布置组的列表,执行:
ceph pg dump
要查看它的OSD内的代理设置或最多设置安置组,执行:
ceph pg map {pg-num}
结果应该包括:osdmap的epoch(eNNN),安置组号码({PG-NUM}),处于Up Set的OSDS和处于the acting set的OSDS。
osdmap eNNN pg {pg-num} -> up [0,1,2] acting [0,1,2]
Note
如果the up set和the acting set 不匹配,这可能是一个指标,集群正在重新平衡或有一个潜在的问题。
PEERING——凝视
Before you can write data to a placement group, it must be in an active state, and it should be in a clean state. For Ceph to determine the current state of a placement group, the primary OSD of the placement group (i.e., the first OSD in the acting set), peers with the secondary and tertiary OSDs to establish agreement on the current state of the placement group (assuming a pool with 3 replicas of the PG).
将数据写入到安置组之前,Ceph必须是一个active 状态,它 应该是一个 clean 的状态。Ceph的确定安置组的当前状态,主要OSD安置组(即第一OSD行事集),二级和三级的OSD同行建立协议安置组的当前状态(假设一个游泳池,3个复制品的PG)。
The OSDs also report their status to the monitor. See Configuring Monitor/OSD Interaction for details. To troubleshoot peering issues, seePeering Failure.
MONITORING PLACEMENT GROUP STATES
If you execute a command such as ceph health, ceph -s or ceph -w, you may notice that the cluster does not always echo backHEALTH OK. After you check to see if the OSDs are running, you should also check placement group states. You should expect that the cluster will NOT echo HEALTH OK in a number of placement group peering-related circumstances:
- You have just created a pool and placement groups haven’t peered yet.
- The placement groups are recovering.
- You have just added an OSD to or removed an OSD from the cluster.
- You have just modified your CRUSH map and your placement groups are migrating.
- There is inconsistent data in different replicas of a placement group.
- Ceph is scrubbing a placement group’s replicas.
If one of the foregoing circumstances causes Ceph to echo HEALTH WARN, don’t panic. In many cases, the cluster will recover on its own. In some cases, you may need to take action. An important aspect of monitoring placement groups is to ensure that when the cluster is up and running that all placement groups are active, and preferably in the clean state. To see the status of all placement groups, execute:
ceph pg stat
The result should tell you the placement group map version (vNNNNNN), the total number of placement groups (x), and how many placement groups are in a particular state such as active+clean (y).
vNNNNNN: x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
Note
It is common for Ceph to report multiple states for placement groups.
In addition to the placement group states, Ceph will also echo back the amount of data used (aa), the amount of storage capacity remaining (bb), and the total storage capacity for the placement group. These numbers can be important in a few cases:
- You are reaching your near full ratio or full ratio.
- Your data isn’t getting distributed across the cluster due to an error in your CRUSH configuration.
Placement Group IDs
Placement group IDs consist of the pool number (not pool name) followed by a period (.) and the placement group ID–a hexadecimal number. You can view pool numbers and their names from the output of ceph osd lspools. The default pool names data, metadataand rbd correspond to pool numbers 0, 1 and 2 respectively. A fully qualified placement group ID has the following form:
{pool-num}.{pg-id}
And it typically looks like this:
0.1f
To retrieve a list of placement groups, execute the following:
ceph pg dump
You can also format the output in JSON format and save it to a file:
ceph pg dump -o {filename} --format=json
To query a particular placement group, execute the following:
ceph pg {poolnum}.{pg-id} query
Ceph will output the query in JSON format.
{
"state": "active+clean",
"up": [
1,
0
],
"acting": [
1,
0
],
"info": {
"pgid": "1.e",
"last_update": "4'1",
"last_complete": "4'1",
"log_tail": "0'0",
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 1,
"last_epoch_started": 537,
"last_epoch_clean": 537,
"last_epoch_split": 534,
"same_up_since": 536,
"same_interval_since": 536,
"same_primary_since": 536,
"last_scrub": "4'1",
"last_scrub_stamp": "2013-01-25 10:12:23.828174"
},
"stats": {
"version": "4'1",
"reported": "536'782",
"state": "active+clean",
"last_fresh": "2013-01-25 10:12:23.828271",
"last_change": "2013-01-25 10:12:23.828271",
"last_active": "2013-01-25 10:12:23.828271",
"last_clean": "2013-01-25 10:12:23.828271",
"last_unstale": "2013-01-25 10:12:23.828271",
"mapping_epoch": 535,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 1,
"last_epoch_clean": 1,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "4'1",
"last_scrub_stamp": "2013-01-25 10:12:23.828174",
"log_size": 128,
"ondisk_log_size": 128,
"stat_sum": {
"num_bytes": 205,
"num_objects": 1,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_unfound": 0,
"num_read": 1,
"num_read_kb": 0,
"num_write": 3,
"num_write_kb": 1
},
"stat_cat_sum": {
},
"up": [
1,
0
],
"acting": [
1,
0
]
},
"empty": 0,
"dne": 0,
"incomplete": 0
},
"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2013-01-23 09:35:37.594691",
"might_have_unfound": [
],
"scrub": {
"scrub_epoch_start": "536",
"scrub_active": 0,
"scrub_block_writes": 0,
"finalizing_scrub": 0,
"scrub_waiting_on": 0,
"scrub_waiting_on_whom": [
]
}
},
{
"name": "Started",
"enter_time": "2013-01-23 09:35:31.581160"
}
]
}
The following subsections describe common states in greater detail.
CREATING
When you create a pool, it will create the number of placement groups you specified. Ceph will echo creating when it is creating one or more placement groups. Once they are created, the OSDs that are part of a placement group’s Acting Set will peer. Once peering is complete, the placement group status should be active+clean, which means a Ceph client can begin writing to the placement group.
PEERING
When Ceph is Peering a placement group, Ceph is bringing the OSDs that store the replicas of the placement group into agreement about the state of the objects and metadata in the placement group. When Ceph completes peering, this means that the OSDs that store the placement group agree about the current state of the placement group. However, completion of the peering process does NOT mean that each replica has the latest contents.
Authoratative History
Ceph will NOT acknowledge a write operation to a client, until all OSDs of the acting set persist the write operation. This practice ensures that at least one member of the acting set will have a record of every acknowledged write operation since the last successful peering operation.
With an accurate record of each acknowledged write operation, Ceph can construct and disseminate a new authoritative history of the placement group–a complete, and fully ordered set of operations that, if performed, would bring an OSD’s copy of a placement group up to date.
ACTIVE
Once Ceph completes the peering process, a placement group may become active. The active state means that the data in the placement group is generally available in the primary placement group and the replicas for read and write operations.
CLEAN
When a placement group is in the clean state, the primary OSD and the replica OSDs have successfully peered and there are no stray replicas for the placement group. Ceph replicated all objects in the placement group the correct number of times.
DEGRADED
When a client writes an object to the primary OSD, the primary OSD is responsible for writing the replicas to the replica OSDs. After the primary OSD writes the object to storage, the placement group will remain in a degraded state until the primary OSD has received an acknowledgement from the replica OSDs that Ceph created the replica objects successfully.
The reason a placement group can be active+degraded is that an OSD may be active even though it doesn’t hold all of the objects yet. If an OSD goes down, Ceph marks each placement group assigned to the OSD as degraded. The OSDs must peer again when the OSD comes back online. However, a client can still write a new object to a degraded placement group if it is active.
If an OSD is down and the degraded condition persists, Ceph may mark the down OSD as out of the cluster and remap the data from thedown OSD to another OSD. The time between being marked down and being marked out is controlled by mon osd down out interval, which is set to 300 seconds by default.
A placement group can also be degraded, because Ceph cannot find one or more objects that Ceph thinks should be in the placement group. While you cannot read or write to unfound objects, you can still access all of the other objects in the degraded placement group.
RECOVERING
Ceph was designed for fault-tolerance at a scale where hardware and software problems are ongoing. When an OSD goes down, its contents may fall behind the current state of other replicas in the placement groups. When the OSD is back up, the contents of the placement groups must be updated to reflect the current state. During that time period, the OSD may reflect a recovering state.
Recovery isn’t always trivial, because a hardware failure might cause a cascading failure of multiple OSDs. For example, a network switch for a rack or cabinet may fail, which can cause the OSDs of a number of host machines to fall behind the current state of the cluster. Each one of the OSDs must recover once the fault is resolved.
Ceph provides a number of settings to balance the resource contention between new service requests and the need to recover data objects and restore the placement groups to the current state. The osd recovery delay start setting allows an OSD to restart, re-peer and even process some replay requests before starting the recovery process. The osd recovery threads setting limits the number of threads for the recovery process (1 thread by default). The osd recovery thread timeout sets a thread timeout, because multiple OSDs may fail, restart and re-peer at staggered rates. The osd recovery max active setting limits the number of recovery requests an OSD will entertain simultaneously to prevent the OSD from failing to serve . The osd recovery max chunk setting limits the size of the recovered data chunks to prevent network congestion.
BACK FILLING
When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs in the cluster to the newly added OSD. Forcing the new OSD to accept the reassigned placement groups immediately can put excessive load on the new OSD. Back filling the OSD with the placement groups allows this process to begin in the background. Once backfilling is complete, the new OSD will begin serving requests when it is ready.
During the backfill operations, you may see one of several states: backfill_wait indicates that a backfill operation is pending, but isn’t underway yet; backfill indicates that a backfill operation is underway; and, backfill_too_full indicates that a backfill operation was requested, but couldn’t be completed due to insufficient storage capacity.
Ceph provides a number of settings to manage the load spike associated with reassigning placement groups to an OSD (especially a new OSD). By default, osd_max_backfills sets the maximum number of concurrent backfills to or from an OSD to 10. The osd backfillfull ratio enables an OSD to refuse a backfill request if the OSD is approaching its its full ratio (85%, by default). If an OSD refuses a backfill request, the osd backfill retry interval enables an OSD to retry the request (after 10 seconds, by default). OSDs can also set osd backfill scan min and osd backfill scan max to manage scan intervals (64 and 512, by default).
REMAPPED
When the Acting Set that services a placement group changes, the data migrates from the old acting set to the new acting set. It may take some time for a new primary OSD to service requests. So it may ask the old primary to continue to service requests until the placement group migration is complete. Once data migration completes, the mapping uses the primary OSD of the new acting set.
STALE
While Ceph uses heartbeats to ensure that hosts and daemons are running, the ceph-osd daemons may also get into a stuck state where they aren’t reporting statistics in a timely manner (e.g., a temporary network fault). By default, OSD daemons report their placement group, up thru, boot and failure statistics every half second (i.e., 0.5), which is more frequent than the heartbeat thresholds. If the Primary OSD of a placement group’s acting set fails to report to the monitor or if other OSDs have reported the primary OSD down, the monitors will mark the placement group stale.
When you start your cluster, it is common to see the stale state until the peering process completes. After your cluster has been running for awhile, seeing placement groups in the stale state indicates that the primary OSD for those placement groups is down or not reporting placement group statistics to the monitor.
IDENTIFYING TROUBLED PGS
As previously noted, a placement group isn’t necessarily problematic just because its state isn’t active+clean. Generally, Ceph’s ability to self repair may not be working when placement groups get stuck. The stuck states include:
- Unclean: Placement groups contain objects that are not replicated the desired number of times. They should be recovering.
- Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come back up.
- Stale: Placement groups are in an unknown state, because the OSDs that host them have not reported to the monitor cluster in a while (configured by mon osd report timeout).
To identify stuck placement groups, execute the following:
ceph pg dump_stuck [unclean|inactive|stale]
See Placement Group Subsystem for additional details. To troubleshoot stuck placement groups, see Troubleshooting PG Errors.
FINDING AN OBJECT LOCATION
To store object data in the Ceph Object Store, a Ceph client must:
- Set an object name
- Specify a pool
The Ceph client retrieves the latest cluster map and the CRUSH algorithm calculates how to map the object to a placement group, and then calculates how to assign the placement group to an OSD dynamically. To find the object location, all you need is the object name and the pool name. For example:
ceph osd map {poolname} {object-name}
Excercise: Locate an Object
As an exercise, lets create an object. Specify an object name, a path to a test file containing some object data and a pool name using the rados put command on the command line. For example:
rados put {object-name} {file-path} --pool=data
rados put test-object-1 testfile.txt --pool=data
To verify that the Ceph Object Store stored the object, execute the following:
rados -p data ls
Now, identify the object location:
ceph osd map {pool-name} {object-name}
ceph osd map data test-object-1
Ceph should output the object’s location. For example:
osdmap e537 pool 'data' (0) object 'test-object-1' -> pg 0.d1743484 (0.4) -> up [1,0] acting [1,0]
To remove the test object, simply delete it using the rados rm command. For example:
rados rm test-object-1 --pool=data
As the cluster evolves, the object location may change dynamically. One benefit of Ceph’s dynamic rebalancing is that Ceph relieves you from having to perform the migration manually. See the Architecture section for details.