使用coldfusion为内容所有者屏幕抓取内部网站

时间:2022-03-05 09:58:52

First of all, this is a legit request. I need to obtain the ower ids for web sites on our intranet. There are about 3000 people I need to look up so instead of manually clicking on each site and seeing the id, copy and paste into my excel worksheet, I thought I'd just loop over the list (which I already have) and screen scrape the owenr id. I thought I'd build a cf page that would go get the pages and store the resulting content into a database. What I'd like to do though is remove everything else from teh returned page and retain only the Owner ID value. In the code below, the value I'm looking for is tb1245. This is the resulting content returned:

首先,这是一个合法的要求。我需要在我们的Intranet上获取网站的功能ID。我需要查找大约3000个人,而不是手动点击每个站点并查看id,复制并粘贴到我的excel工作表中,我以为我只是循环遍历列表(我已经拥有)和屏幕刮擦owenr id。我以为我会构建一个cf页面来获取页面并将生成的内容存储到数据库中。我想做的是删除返回页面中的所有其他内容并仅保留所有者ID值。在下面的代码中,我正在寻找的值是tb1245。这是返回的结果内容:

<table>
<tr>
    <td>Site/Folder Name:</td>
    <td>AppliedScien<td>
</tr>
<tr>
    <td>Vanity URL:</td>
    <td>N/A</td>
</tr>
<tr>
    <td>Owner ID:</td>
    <td>tb1245
</tr>
<tr>
    <td>Owner Name:</td>

            <td>
                <a style="font-family: ariel">Tom W&nbsp;BEST&nbsp;(tb1245)&nbsp;</a>
                <a style="font-family: 'Wingdings'; font-size: 12pt; color: blue;" href="mailto:tb1245@us.domain.com">*</a>&nbsp;
                <a style="font-family: 'Wingdings'; font-size: 12pt; color: blue;" href="javascript:webPhone('tb1245')">(</a>
            </td>

    </tr>

    <tr>
        <td>Web/Server Admin:</td>
        <td>

                    <a style="font-family: ariel">Ohtro J&nbsp;Pepper&nbsp;(tc6139)&nbsp;</a>
                    <a style="font-family: 'Wingdings'; font-size: 12pt; color: blue;" href="mailto:ot9533@swmail.domain.com">*</a>&nbsp;
                    <a style="font-family: 'Wingdings'; font-size: 12pt; color: blue;" href="javascript:phonebook('ot9533')">(</a>

        </td>
    </tr>

Can someone help me with this? I'm supposed to have it completed by Friday but man is this mindnumbing work so I'd rather do it through coldfusion and impress my boss. :D

有人可以帮我弄这个吗?我本应该在星期五之前把它完成,但是男人就是这种思维方式的工作,所以我宁愿通过冷敷来打动我的老板。 :d

TIA

1 个解决方案

#1


1  

So assuming you've got your list of 3000 URLs that you're looping over. For each one of those:

假设您已经获得了循环的3000个URL列表。对于每一个:

Use CFHTTP to get the content. It's returned in cfhttp.fileContent.

使用CFHTTP获取内容。它在cfhttp.fileContent中返回。

You need to then parse this using a regex to extract that ID. This worked for me with your content:

然后,您需要使用正则表达式解析此提取该ID。这对我的内容很有用:

<cfoutput>
ID: #reReplaceNoCase(cfhttp.fileContent, ".*<tr>\s*<td>Owner ID:</td>\s*<td>([a-z0-9]+)\s*</tr>.*", "\1")#
</cfoutput>

#1


1  

So assuming you've got your list of 3000 URLs that you're looping over. For each one of those:

假设您已经获得了循环的3000个URL列表。对于每一个:

Use CFHTTP to get the content. It's returned in cfhttp.fileContent.

使用CFHTTP获取内容。它在cfhttp.fileContent中返回。

You need to then parse this using a regex to extract that ID. This worked for me with your content:

然后,您需要使用正则表达式解析此提取该ID。这对我的内容很有用:

<cfoutput>
ID: #reReplaceNoCase(cfhttp.fileContent, ".*<tr>\s*<td>Owner ID:</td>\s*<td>([a-z0-9]+)\s*</tr>.*", "\1")#
</cfoutput>