如何查询数据库中用户的排名，但只考虑每个用户的最新条目？

Lets say I have a database table called "Scrape" possibly setup like:

可以说我有一个名为“Scrape”的数据库表可能设置为:

UserID (int)   
UserName (varchar)  
Wins (int)   
Losses (int)  
ScrapeDate (datetime)

I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table.

我试图能够根据他们的赢/输比率对我的用户进行排名。但是,每周我都会抓取用户的新数据并在Scrape表中创建另一个条目。

How can I query a list of users sorted by wins/losses, but only taking into consideration the most recent entry (ScrapeDate)?

如何查询按胜/负排序的用户列表,但只考虑最近的条目(ScrapeDate)?

Also, do you think it matters that people will be hitting the site and the scrape may possibly be in the middle of completing?

另外,你认为人们会在网站上进行搜索并且可能在完成过程中是否很重要?

For example I could have:

例如,我可以:

1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09  
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09  
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09

Where, this represents a scrape that has only updated Bob so far, and is in the process of updating Frank but has yet to be inserted. How would you handle this situation as well?

其中,这代表了迄今为止只更新了Bob的一个scrape,并且正在更新Frank但尚未插入。你会如何处理这种情况?

So, my question is:

所以,我的问题是:

How would you handle querying only the most recent scrape of each user to determine the rankings

您将如何处理仅查询每个用户的最新数据以确定排名

Do you think the fact that the database may be in a state of updating (especially if a scrape could take up to 1 day to complete), and not all users have completely updated yet matters? If so, how would you handle this?

您是否认为数据库可能处于更新状态(特别是如果刮擦可能需要1天才能完成),而且并非所有用户都已完全更新了?如果是这样,你会怎么处理?

Thank you, and thank you for your responses you have given me on my related question:

谢谢,感谢您在我的相关问题上给出的回复:

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

从网页上抓取大量统计信息时,我应该多久将收集的结果插入到我的数据库中?

3 个解决方案

#1

This is what I call the "greatest-n-per-group" problem. It comes up several times per week on *.

这就是我所谓的“每组最大n”问题。它每周在*上出现几次。

I solve this type of problem using an outer join technique:

我使用外连接技术解决了这类问题:

SELECT s1.*, s1.wins / s1.losses AS win_loss_ratio
FROM Scrape s1
LEFT OUTER JOIN Scrape s2
  ON (s1.username = s2.username AND s1.ScrapeDate < s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;

This will return only one row for each username -- the row with the greatest value in the ScrapeDate column. That's what the outer join is for, to try to match s1 with some other row s2 with the same username and a greater date. If there is no such row, the outer join returns NULL for all columns of s2, and then we know s1 corresponds to the row with the greatest date for that given username.

这将为每个用户名返回一行 - ScrapeDate列中值最大的行。这就是外连接的用途,尝试将s1与具有相同用户名和更大日期的其他行s2匹配。如果没有这样的行,则外连接为s2的所有列返回NULL,然后我们知道s1对应于给定用户名具有最大日期的行。

This should also work when you have a partially-completed scrape in progress.

当您正在进行部分完成刮擦时,这也应该有效。

This technique isn't necessarily as speedy as the CTE and RANKING solutions other answers have given. You should try both and see what works better for you. The reason I prefer my solution is that it works in any flavor of SQL.

这种技术不一定像其他答案给出的CTE和RANKING解决方案一样快。您应该尝试两种方式,看看哪种方式更适合您。我更喜欢我的解决方案的原因是它适用于任何SQL的风格。

#2

Try something like:

尝试以下方法:

Select user id and max date of last entry for each user.

为每个用户选择最后一个条目的用户ID和最大日期。

Select and order records to get ranking based on above query results.

选择并订购记录以根据上述查询结果获得排名。

This should work, however depends on your database size.

这应该工作,但取决于您的数据库大小。

DECLARE 
    @last_entries TABLE(id int, dte datetime)

-- insert date (dte) of last entry for each user (id)
INSERT INTO
    @last_entries (id, dte)
SELECT
    UserID,
    MAX(ScrapeDate)
FROM
    Scrape WITH (NOLOCK)
GROUP BY
    UserID

-- select ranking
SELECT
    -- optionally you can use RANK OVER() function to get rank value
    UserName,
    Wins,
    Losses
FROM
    @last_entries
    JOIN
        Scraps WITH (NOLOCK)
    ON
        UserID = id
        AND ScrapeDate = dte
ORDER BY
    Winds,
    Losses

I do not test this code, so it could not compile on first run.

我不测试这段代码,因此无法在首次运行时编译。

#3

The answer to part one of your question depends on the version of SQL server you are using - SQL 2005+ offers ranking functions which make this kind of query a bit simpler than in SQL 2000 and before. I'll update this with more detail if you will indicate which platform you're using.

您的问题的第一部分的答案取决于您使用的SQL服务器的版本 - SQL 2005+提供排名函数,使这种查询比SQL 2000及之前更简单。如果您要指明您正在使用哪个平台,我会更详细地更新此信息。

I suspect the clearest way to handle part 2 is to display the stats for the latest complete scraping exercise, otherwise you aren't showing a time-consistent ranking (although, if your data collection exercise takes 24 hours, there's a certain amount of latitude already).

我怀疑处理第2部分最清晰的方法是显示最新的完整抓取练习的统计数据,否则你没有显示时间一致的排名(但是,如果你的数据收集练习需要24小时,那么就有一定的纬度已经)。

To simplify this, you could create a table to hold metadata about each scrape operation, giving each one an id, start date and completion date (at a minimum), and display those records which relate to the latest complete scrape. To make this easier, you could remove the "scrape date" from the data collection table, and replace it with a foreign key linking each data row to a row in the scrape table.

为了简化这一点,您可以创建一个表来保存有关每个scrape操作的元数据,为每个scrape操作提供一个id,开始日期和完成日期(至少),并显示与最新完整scrape相关的记录。为了简化这一过程,您可以从数据收集表中删除“刮擦日期”,并将其替换为将每个数据行链接到刮擦表中的行的外键。

EDIT

The following code illustrates how to rank users by their latest score, regardless of whether they are time-consistent:

以下代码说明了如何按最新分数对用户进行排名,无论这些分数是否与时间一致:

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeDate datetime
)

INSERT #scrape
      select 'Alice',100,200,'20090101'
union select 'Alice',120,210,'20090201'
union select 'Bob'  ,200,200,'20090101'
union select 'Clara',300,100,'20090101'
union select 'Clara',300,210,'20090201'
union select 'Dave' ,100,10 ,'20090101'


;with latestScrapeCTE
AS
(
        SELECT *
               ,ROW_NUMBER() OVER (PARTITION BY userName
                                   ORDER BY scrapeDate desc
                                  ) AS rn
               ,wins + losses AS totalPlayed
               ,wins - losses as winDiff
        from #scrape
)
SELECT userName
       ,wins
       ,losses
       ,scrapeDate
       ,winDiff
       ,totalPlayed
       ,RANK() OVER (ORDER BY winDiff desc
                              ,totalPlayed desc
                    ) as rankPos
FROM latestScrapeCTE
WHERE rn = 1
ORDER BY rankPos

EDIT 2

An illustration of the use of a metadata table to select the latest complete scrape:

使用元数据表来选择最新的完整scrape的说明:

create table #scrape_run
(runID int identity
,startDate datetime
,completedDate datetime
)

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeRunID int
)


INSERT #scrape_run
select '20090101', '20090102'
union select '20090201', null --null completion date indicates that the scrape is not complete

INSERT #scrape
      select 'Alice',100,200,1
union select 'Alice',120,210,2
union select 'Bob'  ,200,200,1
union select 'Clara',300,100,1
union select 'Clara',300,210,2
union select 'Dave' ,100,10 ,1


;with latestScrapeCTE
AS
(
        SELECT TOP 1 runID
                     ,startDate
        FROM #scrape_run
        WHERE completedDate IS NOT NULL
)
SELECT userName
       ,wins
       ,losses
       ,startDate     AS scrapeDate
       ,wins - losses AS winDiff
       ,wins + losses AS totalPlayed
       ,RANK() OVER (ORDER BY (wins - losses)  desc
                              ,(wins + losses) desc
                    ) as rankPos
FROM #scrape
JOIN latestScrapeCTE
ON   runID = scrapeRunID
ORDER BY rankPos

#1