admin管理员组

文章数量:1123687

I am using mariadb to store HTML files, and the column defined as MEDIUMTEXET CONPRESSED for HTML with some other columns storing INT and VARCHAR keys. Yet I encountered a rather bizarre behaviour:

  1. When I create a record with HTML, it flies, however
  2. When I create a record first, and add HTML later, it takes several minutes (!!!) to add one HTML.

Needless to say, HTML is not indexed (I don't think you can for a COMPRESSED column). I have about 100,000 of HTML files in the table. Each HTML file is about 150K

Added:

Here is the code

 loadHTML(self, conn, driver, report=False):

    cleaner = Cleaner()
    cleaner.javascript = True # This is True because we want to activate the javascript filter
    cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

    preport(f"in selLoadHTML for {type(self)}") 
    cursor = conn.cursor()
    try:
        self.openMyURL(driver, report=report)
        preport(f"opened {self.url}") 
        the_html = driver.page_source
        preport(f"loaded {len(the_html)} bytes") 
        self.html = cleaner.clean_html(the_html)
        preport(f"cleaned {len(self.html)} bytes") 
        cursor.execute(""" update Page set html=%s, htmlLoaded=NOW(), htmlError=NULL""", (self.html,))
        preport(f"updated the database")
    except Exception as e:
        print(f"Error opening {self.url}, exception {e}")
        cursor.execute(""" update Page set htmlError=NOW()""")
    connmit()
    cursor.close()
    return 

preport is a wrapper around print that reports time since the previous reporting

Here is an output:

After  0:00:08.334192 ,  opened <url here>
After  0:00:00.049598 ,  loaded 918988 bytes
After  0:00:00.032835 ,  cleaned 43692 bytes
After  0:03:06.277489 ,  updated the database

As you can see, it took over 3 minutes to save 43K of HTML, which prior to that was NULL

Here is a schema (simplified)

CREATE TABLE Page (
url VARCHAR(100) UNIQUE KEY,
html MEDIUMTEXT COMPRESSED,
htmlError DATETIME,
refId int unsigned,
foreign key (refId) references Reference (refId)

For some other cases, I was just populating the database:

cur.execute("""INSERT INTO Reference (refId, refText) VALUES (%s, %s) ON DUPLICATE KEY UPDATE refText = %s""", (self.ref_id, self.ref_text, self.ref_text))
cur.execute("""INSERT IGNORE INTO Page (url, html, refId) VALUES (%s, %s, %s) """, (self.url, self.html, self.ref_id))

I don't have performance data, but it flied - no more than several seconds per record.

本文标签: pythonmariadb MEDIUMTEXT COMPRESSED updating takes foretherStack Overflow