从JSON文件中删除重复的条目-漂亮的汤

时间:2022-12-01 00:13:54

I am running a script to scape a website for textbook information and I have the script working. However, when it writes to a JSON file it is giving me duplicate results. I am trying to figure out how to remove the duplicates from the JSON file. Here is my code:

我正在运行一个脚本以浏览一个网站的教科书信息,我有脚本工作。但是,当它写入JSON文件时,它会给我重复的结果。我正在尝试找出如何从JSON文件中删除副本。这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = ['https://open.bccampus.ca/find-open-textbooks/', 
'https://open.bccampus.ca/find-open-textbooks/?start=10']

data = []
#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.findAll("h4")

    for container in containers:
       item = {}
       item['type'] = "Textbook"
       item['title'] = container.parent.a.text
       item['author'] = container.nextSibling.findNextSibling(text=True)
       item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
       item['source'] = "BC Campus"
       data.append(item) # add the item to the list

with open("./json/bc.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

Here is a sample of the JSON output

这里是JSON输出的示例

{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}

3 个解决方案

#1


1  

Figured it out. Here is the solution in case anyone else runs into this issue:

算出来。这里是解决方案,以防其他人遇到这个问题:

textbook_list = []
for item in data:
    if item not in textbook_list:
        textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
    json.dump(textbook_list, writeJSON, ensure_ascii=False)

#2


0  

You do not need to remove any kind of duplicates.

您不需要删除任何类型的副本。

The only need is to update the code.

唯一的需要是更新代码。

Please keep reading. I have provided detailed description related to this problem. Also don't forget to check this gist https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c which I had written to debug your code.

请继续阅读。我已经提供了关于这个问题的详细描述。也不要忘记检查一下我编写的用于调试代码的https://gist.github.com/hygull/44cfdc1d4e703b703b70eb14f16fec14bfbf2c的要点。

» WHERE THE PROBLEM WAS?

I know you want this because you're getting duplicated dictionaries.

我知道你想要这个,因为你的字典被复制了。

This is because you're selecting containers as h4 elements & f or each book details, the specified page links https://open.bccampus.ca/find-open-textbooks/ and https://open.bccampus.ca/find-open-textbooks/?start=10 are having 2 h4 elements.

这是因为您选择容器作为h4元素和f或每本书的详细信息,指定的页面链接https://open.bccampus。ca / find-open-textbooks /和https://open.bccampus.ca/find-open-textbooks/?开始=10有2个h4元素。

That's why, instead of getting a list of 20 items(10 from each page) as containers list you're getting just double i.e. list of 40 items where each item is h4 element.

这就是为什么,不是得到一个20个项目的列表(每个页面10个)作为容器列表,而是得到两个项目的列表,即40个项目的列表,其中每个项目都是h4元素。

You may get different different values for each of these 40 items but the problem is while selecting parents. As it gives the same element so the same text.

对于这40个项目,您可能会得到不同的值,但问题是在选择双亲时。因为它给出了相同的元素,同样的文本。

Let's clarify the problem by assuming the following dummy code.

让我们通过假设下面的伪代码来澄清这个问题。

Note: You can also visit and check https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c as it has the Python code which I have created to debug and solve this problem. You may get some IDEA.

注意:您还可以访问和检查https://gist.github.com/hygull/44cfdc1d4e703b7014eb14f16fecbf14bf2c,因为它有我创建的用于调试和解决这个问题的Python代码。你可能会有一些想法。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>
<li> <!-- 2nd book -->
    <h4>
        <a> Text 3 </a>
    </h4>
    <h4>
        <a> Text 4 </a>
    </h4>
</li>
...
...
<li> <!-- 20th book -->
    <h4>
        <a> Text 39 </a>
    </h4>
    <h4>
        <a> Text 40 </a>
    </h4>
</li>

»» containers = page_soup.find_all("h4"); will give the below list of h4 elements.

»»容器= page_soup.find_all(h4);将给出以下h4元素列表。

[
    <h4>
        <a> Text 1 </a>
    </h4>,
    <h4>
        <a> Text 2 </a>
    </h4>,
    <h4>
        <a> Text 3 </a>
    </h4>,
    <h4>
        <a> Text 4 </a>
    </h4>,
    ...
    ...
    ...
    <h4>
        <a> Text 39 </a>
    </h4>,
    <h4>
        <a> Text 40 </a>
    </h4>
]

»» In case of your code, 1st iteration of inner for loop will refer below element as container variable.

»对于您的代码,循环的第一次迭代将在元素下面作为容器变量。

<h4>
    <a> Text 1 </a>
</h4>

»» 2nd iteration will refer below element as container variable.

»第二次迭代将把下面的元素称为容器变量。

<h4>
    <a> Text 1 </a>
</h4>

»» In both the above (1st, 2nd) iterations of inner for loop, container.parent; will give the below element.

在上述(1、2)内循环的内部循环中,容器。将给出下面的元素。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>

»» And container.parent.a will give the below element.

»»和container.parent。a将给出下面的元素。

<a> Text 1 </a>

»» Finally, container.parent.a.text gives the below text as our book title for first 2 books.

»»最后,container.parent.a。文本给出了下面的文本作为我们的书标题的前两本书。

Text 1

That's why we are getting duplicated dictionaries as our dynamic title & author are also same.

这就是为什么我们的动态标题和作者也是一样的。

Let's get rid of this problem 1 by 1.

让我们把这个问题1除以1。

» WEB PAGE DETAILS:

  1. We have links of 2 web pages.
  2. 我们有两个网页的链接。

从JSON文件中删除重复的条目-漂亮的汤

从JSON文件中删除重复的条目-漂亮的汤

  1. Each web page is having details of 10 text books.

    每个网页都有10本教科书的详细信息。

  2. Each book details is having 2 h4 elements.

    每本书的细节都有2个h4元素。

  3. So total, 2x10x2 = 40 h4 elements.

    2x10x2 = 40h4元素。

» OUR GOAL:

  1. Our goal is to only get an array/list of 20 dictionaries not 40.

    我们的目标是只获得一个包含20个字典的数组/列表,而不是40个字典。

  2. So there's a need to iterate the containers list by 2 items i.e. by just skipping 1 item in each iteration.

    因此,需要对容器列表进行2项迭代,即每次迭代只跳过1项。

» MODIFIED WORKING CODE:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
  'https://open.bccampus.ca/find-open-textbooks/', 
  'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}
        item['type'] = "Textbook"
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['title'] = containers[index].parent.a.text
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True)

    data.append(item) # add the item to the list

with open("./json/bc-modified-final.json", "w") as writeJSON:
  json.dump(data, writeJSON, ensure_ascii=False)

» OUTPUT:

[
    {
        "type": "Textbook",
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "authors": " Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Exploring Movie Construction and Production",
        "authors": " John Reich, SUNY Genesee Community College",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Project Management",
        "authors": " Adrienne Watt",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    ...
    ...
    ...
    {
        "type": "Textbook",
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "authors": " Michelle Bonczek Evory. Western Michigan University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus"
    }
]

Finally, I tried to modify your code and added more details description, date & categories to dictionary object.

最后,我尝试修改您的代码,并向dictionary对象添加了更多的细节描述、日期和类别。

Python version: 3.6

Python版本:3.6

Dependency: pip install beautifulsoup4

依赖:pip安装beautifulsoup4

» MODIFIED WORKING CODE (ENHANCED VERSION):

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
    'https://open.bccampus.ca/find-open-textbooks/', 
    'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}

        # Store book's information as per given the web page (all 5 are dynamic)
        item['title'] = containers[index].parent.a.text
        item["catagories"] = [a_tag.text for a_tag in containers[index + 1].find_all('a')]
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True).strip()
        item['date'] = containers[index].parent.find_all("strong")[1].findNextSibling(text=True).strip()
        item["description"] = containers[index].parent.p.text.strip()

        # Store extra information (1st is dynamic, last 2 are static)
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['type'] = "Textbook"

        data.append(item) # add the item to the list

with open("./json/bc-modified-final-my-own-version.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

» OUTPUT (ENHANCED VERSION):

[
    {
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "catagories": [
            "Ancillary Resources"
        ],
        "authors": "Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "date": "May 3, 2018",
        "description": "Description: The purpose of this textbook is to help learners develop best practices in vital sign measurement. Using a multi-media approach, it will provide opportunities to read about, observe, practice, and test vital sign measurement.",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    {
        "title": "Exploring Movie Construction and Production",
        "catagories": [
            "Adopted"
        ],
        "authors": "John Reich, SUNY Genesee Community College",
        "date": "May 2, 2018",
        "description": "Description: Exploring Movie Construction and Production contains eight chapters of the major areas of film construction and production. The discussion covers theme, genre, narrative structure, character portrayal, story, plot, directing style, cinematography, and editing. Important terminology is defined and types of analysis are discussed and demonstrated. An extended example of how a movie description reflects the setting, narrative structure, or directing style is used throughout the book to illustrate ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    ...
    ...
    ...
    {
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "catagories": [],
        "authors": "Michelle Bonczek Evory. Western Michigan University",
        "date": "Apr 27, 2018",
        "description": "Description: Informed by a writing philosophy that values both spontaneity and discipline, Michelle Bonczek Evory’s Naming the Unnameable: An Approach to Poetry for New Generations  offers practical advice and strategies for developing a writing process that is centered on play and supported by an understanding of America’s rich literary traditions. With consideration to the psychology of invention, Bonczek Evory provides students with exercises aimed to make writing in its early stages a form of play that ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    }
]

That's it. Thanks.

就是这样。谢谢。

#3


0  

We better use set data structures instead of a list. It doesn't preserve the order but it doesn't store the duplicates like list.

我们最好使用集合数据结构而不是列表。它不保存订单,但它不存储重复的列表。

Change your code

改变你的代码

 data = []

To

data = set()

And

data.append(item)

To

data.add(item)

#1


1  

Figured it out. Here is the solution in case anyone else runs into this issue:

算出来。这里是解决方案,以防其他人遇到这个问题:

textbook_list = []
for item in data:
    if item not in textbook_list:
        textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
    json.dump(textbook_list, writeJSON, ensure_ascii=False)

#2


0  

You do not need to remove any kind of duplicates.

您不需要删除任何类型的副本。

The only need is to update the code.

唯一的需要是更新代码。

Please keep reading. I have provided detailed description related to this problem. Also don't forget to check this gist https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c which I had written to debug your code.

请继续阅读。我已经提供了关于这个问题的详细描述。也不要忘记检查一下我编写的用于调试代码的https://gist.github.com/hygull/44cfdc1d4e703b703b70eb14f16fec14bfbf2c的要点。

» WHERE THE PROBLEM WAS?

I know you want this because you're getting duplicated dictionaries.

我知道你想要这个,因为你的字典被复制了。

This is because you're selecting containers as h4 elements & f or each book details, the specified page links https://open.bccampus.ca/find-open-textbooks/ and https://open.bccampus.ca/find-open-textbooks/?start=10 are having 2 h4 elements.

这是因为您选择容器作为h4元素和f或每本书的详细信息,指定的页面链接https://open.bccampus。ca / find-open-textbooks /和https://open.bccampus.ca/find-open-textbooks/?开始=10有2个h4元素。

That's why, instead of getting a list of 20 items(10 from each page) as containers list you're getting just double i.e. list of 40 items where each item is h4 element.

这就是为什么,不是得到一个20个项目的列表(每个页面10个)作为容器列表,而是得到两个项目的列表,即40个项目的列表,其中每个项目都是h4元素。

You may get different different values for each of these 40 items but the problem is while selecting parents. As it gives the same element so the same text.

对于这40个项目,您可能会得到不同的值,但问题是在选择双亲时。因为它给出了相同的元素,同样的文本。

Let's clarify the problem by assuming the following dummy code.

让我们通过假设下面的伪代码来澄清这个问题。

Note: You can also visit and check https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c as it has the Python code which I have created to debug and solve this problem. You may get some IDEA.

注意:您还可以访问和检查https://gist.github.com/hygull/44cfdc1d4e703b7014eb14f16fecbf14bf2c,因为它有我创建的用于调试和解决这个问题的Python代码。你可能会有一些想法。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>
<li> <!-- 2nd book -->
    <h4>
        <a> Text 3 </a>
    </h4>
    <h4>
        <a> Text 4 </a>
    </h4>
</li>
...
...
<li> <!-- 20th book -->
    <h4>
        <a> Text 39 </a>
    </h4>
    <h4>
        <a> Text 40 </a>
    </h4>
</li>

»» containers = page_soup.find_all("h4"); will give the below list of h4 elements.

»»容器= page_soup.find_all(h4);将给出以下h4元素列表。

[
    <h4>
        <a> Text 1 </a>
    </h4>,
    <h4>
        <a> Text 2 </a>
    </h4>,
    <h4>
        <a> Text 3 </a>
    </h4>,
    <h4>
        <a> Text 4 </a>
    </h4>,
    ...
    ...
    ...
    <h4>
        <a> Text 39 </a>
    </h4>,
    <h4>
        <a> Text 40 </a>
    </h4>
]

»» In case of your code, 1st iteration of inner for loop will refer below element as container variable.

»对于您的代码,循环的第一次迭代将在元素下面作为容器变量。

<h4>
    <a> Text 1 </a>
</h4>

»» 2nd iteration will refer below element as container variable.

»第二次迭代将把下面的元素称为容器变量。

<h4>
    <a> Text 1 </a>
</h4>

»» In both the above (1st, 2nd) iterations of inner for loop, container.parent; will give the below element.

在上述(1、2)内循环的内部循环中,容器。将给出下面的元素。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>

»» And container.parent.a will give the below element.

»»和container.parent。a将给出下面的元素。

<a> Text 1 </a>

»» Finally, container.parent.a.text gives the below text as our book title for first 2 books.

»»最后,container.parent.a。文本给出了下面的文本作为我们的书标题的前两本书。

Text 1

That's why we are getting duplicated dictionaries as our dynamic title & author are also same.

这就是为什么我们的动态标题和作者也是一样的。

Let's get rid of this problem 1 by 1.

让我们把这个问题1除以1。

» WEB PAGE DETAILS:

  1. We have links of 2 web pages.
  2. 我们有两个网页的链接。

从JSON文件中删除重复的条目-漂亮的汤

从JSON文件中删除重复的条目-漂亮的汤

  1. Each web page is having details of 10 text books.

    每个网页都有10本教科书的详细信息。

  2. Each book details is having 2 h4 elements.

    每本书的细节都有2个h4元素。

  3. So total, 2x10x2 = 40 h4 elements.

    2x10x2 = 40h4元素。

» OUR GOAL:

  1. Our goal is to only get an array/list of 20 dictionaries not 40.

    我们的目标是只获得一个包含20个字典的数组/列表,而不是40个字典。

  2. So there's a need to iterate the containers list by 2 items i.e. by just skipping 1 item in each iteration.

    因此,需要对容器列表进行2项迭代,即每次迭代只跳过1项。

» MODIFIED WORKING CODE:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
  'https://open.bccampus.ca/find-open-textbooks/', 
  'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}
        item['type'] = "Textbook"
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['title'] = containers[index].parent.a.text
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True)

    data.append(item) # add the item to the list

with open("./json/bc-modified-final.json", "w") as writeJSON:
  json.dump(data, writeJSON, ensure_ascii=False)

» OUTPUT:

[
    {
        "type": "Textbook",
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "authors": " Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Exploring Movie Construction and Production",
        "authors": " John Reich, SUNY Genesee Community College",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    {
        "type": "Textbook",
        "title": "Project Management",
        "authors": " Adrienne Watt",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
        "source": "BC Campus"
    },
    ...
    ...
    ...
    {
        "type": "Textbook",
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "authors": " Michelle Bonczek Evory. Western Michigan University",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus"
    }
]

Finally, I tried to modify your code and added more details description, date & categories to dictionary object.

最后,我尝试修改您的代码,并向dictionary对象添加了更多的细节描述、日期和类别。

Python version: 3.6

Python版本:3.6

Dependency: pip install beautifulsoup4

依赖:pip安装beautifulsoup4

» MODIFIED WORKING CODE (ENHANCED VERSION):

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = [
    'https://open.bccampus.ca/find-open-textbooks/', 
    'https://open.bccampus.ca/find-open-textbooks/?start=10'
]

data = []

#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.find_all("h4")

    for index in range(0, len(containers), 2):
        item = {}

        # Store book's information as per given the web page (all 5 are dynamic)
        item['title'] = containers[index].parent.a.text
        item["catagories"] = [a_tag.text for a_tag in containers[index + 1].find_all('a')]
        item['authors'] = containers[index].nextSibling.findNextSibling(text=True).strip()
        item['date'] = containers[index].parent.find_all("strong")[1].findNextSibling(text=True).strip()
        item["description"] = containers[index].parent.p.text.strip()

        # Store extra information (1st is dynamic, last 2 are static)
        item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
        item['source'] = "BC Campus"
        item['type'] = "Textbook"

        data.append(item) # add the item to the list

with open("./json/bc-modified-final-my-own-version.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

» OUTPUT (ENHANCED VERSION):

[
    {
        "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
        "catagories": [
            "Ancillary Resources"
        ],
        "authors": "Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
        "date": "May 3, 2018",
        "description": "Description: The purpose of this textbook is to help learners develop best practices in vital sign measurement. Using a multi-media approach, it will provide opportunities to read about, observe, practice, and test vital sign measurement.",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    {
        "title": "Exploring Movie Construction and Production",
        "catagories": [
            "Adopted"
        ],
        "authors": "John Reich, SUNY Genesee Community College",
        "date": "May 2, 2018",
        "description": "Description: Exploring Movie Construction and Production contains eight chapters of the major areas of film construction and production. The discussion covers theme, genre, narrative structure, character portrayal, story, plot, directing style, cinematography, and editing. Important terminology is defined and types of analysis are discussed and demonstrated. An extended example of how a movie description reflects the setting, narrative structure, or directing style is used throughout the book to illustrate ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    },
    ...
    ...
    ...
    {
        "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
        "catagories": [],
        "authors": "Michelle Bonczek Evory. Western Michigan University",
        "date": "Apr 27, 2018",
        "description": "Description: Informed by a writing philosophy that values both spontaneity and discipline, Michelle Bonczek Evory’s Naming the Unnameable: An Approach to Poetry for New Generations  offers practical advice and strategies for developing a writing process that is centered on play and supported by an understanding of America’s rich literary traditions. With consideration to the psychology of invention, Bonczek Evory provides students with exercises aimed to make writing in its early stages a form of play that ...[more]",
        "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
        "source": "BC Campus",
        "type": "Textbook"
    }
]

That's it. Thanks.

就是这样。谢谢。

#3


0  

We better use set data structures instead of a list. It doesn't preserve the order but it doesn't store the duplicates like list.

我们最好使用集合数据结构而不是列表。它不保存订单,但它不存储重复的列表。

Change your code

改变你的代码

 data = []

To

data = set()

And

data.append(item)

To

data.add(item)