Cách làm Ebook

Chia nhỏ file pdf - https://www.ilovepdf.com/
Upload Google Drive, mở pdf trong Word ⇒ OCR để convert ảnh/pdf sang text
Biên tập file Word
Convert file docx sang epub bằng Calibre
Biên tập html/css file
Đóng gói (Convert to epub bằng Calibre)

Viết script python để gom các chương lại thành html file, sửa cho dễ

  
Giờ tôi đang muốn edit cuốn sách: "30 tuổi - Mọi thứ chỉ mới bắt đầu".  
  
Các thao tác từ giờ trở đi của tôi sẽ thao tác trong folder đó.  
  
Hãy giúp tôi làm các đầu việc sau đây:  
  
1. Với tất cả các file html trong đó, hãy update Title thành: "30 tuổi - mọi thứ chỉ mới bắt đầu"  
2. Xóa bỏ hết id, class của các tag: <h1>, <h2>, h3, p, span.  
3. Nếu có nhiều hơn 2 tag <br> cạnh nhau, xóa bớt đi, chỉ giữ lại 1 tag. 
4. Nếu thẻ <p> bên trong không có content, hoặc content là string rỗng/ empty thì cũng xóa bớt đi.

-----

- VỚi thẻ h1 thì thêm class="c", đồng thời thêm id dạng: "toc_id_{number}" để tôi có thể link vào file table of content.  
- Với thẻ h2 thì thêm class = "c1"
- Add thêm stylesheet đơn giản cho thẻ h3, tăng font chữ, căn giữa, ..
Các thẻ p đầu tiên bên dưới các thẻ heading thì sẽ thêm class "pcalibre pcalibre1"  
Còn các thẻ p còn lại sẽ thêm class="txt"

---
Trước hết, hãy đánh lại id dạng "toc_id_{number}" cho các thẻ h1 và h2 trong tất cả các file của quyển "30 tuổi - mọi thứ mới chỉ bắt đầu"  
  
  
Sau đó Dựa trên style của file index_split_003.html cuốn Yours Truly, hãy giúp tôi tạo trang Mục lục cho quyển sách này giúp tôi, ghi vào file index_split_0000.html  
  
Đọc từng file Link tới các thẻ h1, h2 của các file html để lấy id nhé

---

Dựa vào file toc.ncx của cuốn Yours Truly, hãy viết lại toc.ncx cho cuốn 30 Tuổi.  
  
Link và content lấy trong folder

Biên tập file word

Notes

Mục đích bước này là OCR to get text - không cần phải quá quan trọng style

Font size cho toàn bộ văn bản
- Chú ý tới font mặc định
- Time New Roman, font 12
- Format > Paragraph styles > Normal Text > Update ‘Normal text’ to match rồi vào Options > Save as my default styles.
Giãn dòng - Line spacing
- Format > Line & paragraph spacing - Chọn 1.5
- Paragraph spacing before/after: 0 pt (hoặc 6 pt nếu bạn thích có khoảng cách giữa đoạn)
Căn đều nội dung văn bản
- Chọn văn bản
- Format > Align & Indent > Justified.
Đánh số trang
- Insert > Page numbers.
Tạo tiêu đề và Mục lục
Set ruler đồng bộ cho toàn bộ file

Cmd + Enter để break content sang trang mới

Biên tập file epub

File doc sau khi convert/ soát lỗi chính tả xong thì vẫn sẽ còn nhiều lỗi linh tinh ⇒ Chỉnh sửa dần
Mở lên bằng Calibre, convert từ docx sang epub
Save to disk file epub này, đổi tên file thành .zip sau đó giải nén sẽ ra được 1 folder code html, css, …
Mở băng Cursor, sau đó edit html file
Nên học theo cách style của các cuốn ebook mình đã từng làm trước đây, đại khái là sẽ có style cho:
- h1 heading
- h2 heading
- images
- quotes
- Một số trang đặc biệt
Sau khi sửa xong, cần phải chỉnh lại trong file content.opf cho map với file của mình
Các bước trên đã done thì zip lại folder, đổi tên thành .epub, hoặc sử dụng script bên dưới
Nếu file epub bị lỗi, thì mở nó bằng Calibre, sau đó dùng tính năng convert để convert nó sang epub file

import zipfile
import os
 
def create_epub(epub_name, source_dir):
    with zipfile.ZipFile(epub_name, 'w') as epub:
        # Add mimetype file first and uncompressed
        epub.write(os.path.join(source_dir, 'mimetype'), 'mimetype', compress_type=zipfile.ZIP_STORED)
 
        # Walk the folder and add the rest
        for foldername, subfolders, filenames in os.walk(source_dir):
            for filename in filenames:
                if filename == 'mimetype':
                    continue
                full_path = os.path.join(foldername, filename)
                rel_path = os.path.relpath(full_path, source_dir)
                epub.write(full_path, rel_path)
 
create_epub("output.epub", "BCTC")

Python Scripts

python split_pdf.py
 
brew install tesseract
brew install tesseract-lang
tesseract --list-langs

References

https://drive.google.com/drive/u/0/folders/1tz9zxxkOsMND1TgdhPkctwnF_3kfxZAa - Cách biên tập ebook - Lâm Taxy

import os

import re

from bs4 import BeautifulSoup

import uuid

  

# Đường dẫn thư mục chứa các file HTML

HTML_DIR = os.path.dirname(os.path.abspath(__file__))

  

# Tiêu đề mới

NEW_TITLE = "30 tuổi - mọi thứ chỉ mới bắt đầu"

  

# Các tag cần xóa id/class

TAGS_TO_CLEAN = ["h1", "h2", "h3", "p", "span"]

  

# Style cho thẻ h3

H3_STYLE = "text-align:center;font-size:1.5em;font-weight:bold;padding:0.5em 0;"

  

def clean_html_file(filepath, h1_counter_start=1):

with open(filepath, 'r', encoding='utf-8') as f:

soup = BeautifulSoup(f, 'html.parser')

  

# 1. Update title

if soup.title:

soup.title.string = NEW_TITLE

  

# 2. Xóa id, class các tag h1, h2, h3, p, span

for tag_name in TAGS_TO_CLEAN:

for tag in soup.find_all(tag_name):

tag.attrs = {k: v for k, v in tag.attrs.items() if k not in ['id', 'class', 'style']}

  

# 3. Xóa tất cả tag <br>

for br in soup.find_all('br'):

br.decompose()

  

# 4. Xóa các thẻ <p> rỗng hoặc chỉ chứa khoảng trắng

for p in soup.find_all('p'):

# Nếu không có nội dung hoặc chỉ có khoảng trắng

if not p.get_text(strip=True):

p.decompose()

  

# 5. Thêm class/id cho heading và class cho p

h1_counter = h1_counter_start

for tag in soup.find_all(['h1', 'h2', 'h3']):

if tag.name == 'h1':

tag['class'] = 'c'

tag['id'] = f'toc_id_{h1_counter}'

h1_counter += 1

elif tag.name == 'h2':

tag['class'] = 'c1'

elif tag.name == 'h3':

tag['style'] = H3_STYLE

# h3 không thêm class/id đặc biệt

  

# Tìm thẻ p đầu tiên ngay sau heading

next_tag = tag.find_next_sibling()

while next_tag and next_tag.name is None:

next_tag = next_tag.find_next_sibling()

if next_tag and next_tag.name == 'p':

next_tag['class'] = 'pcalibre pcalibre1'

  

# Các thẻ p còn lại (không có class hoặc class khác) sẽ thêm class txt

for p in soup.find_all('p'):

if 'class' not in p.attrs or p['class'] != 'pcalibre pcalibre1':

p['class'] = 'txt'

  

# Ghi lại file

with open(filepath, 'w', encoding='utf-8') as f:

f.write(str(soup))

return h1_counter

  

def process_and_collect_headings():

h_counter = 1

toc_entries = [] # (filename, tag_name, text, id)

files = [f for f in os.listdir(HTML_DIR) if f.startswith('index_split_') and f.endswith('.html') and f != 'index_split_0000.html']

files.sort()

for filename in files:

filepath = os.path.join(HTML_DIR, filename)

with open(filepath, 'r', encoding='utf-8') as f:

soup = BeautifulSoup(f, 'html.parser')

changed = False

for tag in soup.find_all(['h1', 'h2']):

tag_id = f'toc_id_{h_counter}'

tag['id'] = tag_id

toc_entries.append((filename, tag.name, tag.get_text(strip=True), tag_id))

h_counter += 1

changed = True

if changed:

with open(filepath, 'w', encoding='utf-8') as f:

f.write(str(soup))

return toc_entries

  

def create_toc_html(toc_entries):

toc_html = [

"<?xml version='1.0' encoding='utf-8'?>",

'<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">',

'<head>',

f'<title>{NEW_TITLE}</title>',

'<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>',

'<link href="stylesheet.css" rel="stylesheet" type="text/css"/>',

'<link href="page_styles.css" rel="stylesheet" type="text/css"/>',

'</head>',

'<body class="calibre">',

'<div class="calibre1">',

'<h1 class="c" id="toc-title" style="text-align: center;">Table of Contents</h1>',

'<ul class="toc-list">'

]

for filename, tag, text, tag_id in toc_entries:

if tag == 'h1':

toc_html.append(f'<li class="toc-level"><a href="{filename}#{tag_id}">{text}</a></li>')

elif tag == 'h2':

toc_html.append(f'<li class="toc-level" style="margin-left:2em"><a href="{filename}#{tag_id}">{text}</a></li>')

toc_html += [

'</ul>',

'</div>',

'</body></html>'

]

return '\n'.join(toc_html)

  

def create_toc_ncx(toc_entries):

# Tạo uuid mới cho sách

book_uuid = str(uuid.uuid4())

ncx = [

"<?xml version='1.0' encoding='utf-8'?>",

'<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="eng">',

' <head>',

f' <meta name="dtb:uid" content="{book_uuid}"/>',

' <meta name="dtb:depth" content="2"/>',

' <meta name="dtb:generator" content="custom-script"/>',

' <meta name="dtb:totalPageCount" content="0"/>',

' <meta name="dtb:maxPageNumber" content="0"/>',

' </head>',

' <docTitle>',

f' <text>{NEW_TITLE}</text>',

' </docTitle>',

' <navMap>'

]

play_order = 1

for i, (filename, tag, text, tag_id) in enumerate(toc_entries):

nav_id = f"toc_{i+1}"

ncx.append(f' <navPoint id="{nav_id}" playOrder="{play_order}"><navLabel><text>{text}</text></navLabel><content src="{filename}#{tag_id}"/></navPoint>')

play_order += 1

ncx += [

' </navMap>',

'</ncx>'

]

return '\n'.join(ncx)

  

def main():

toc_entries = process_and_collect_headings()

toc_html = create_toc_html(toc_entries)

toc_path = os.path.join(HTML_DIR, 'index_split_0000.html')

with open(toc_path, 'w', encoding='utf-8') as f:

f.write(toc_html)

# Tạo toc.ncx

ncx_path = os.path.join(HTML_DIR, 'toc.ncx')

ncx_content = create_toc_ncx(toc_entries)

with open(ncx_path, 'w', encoding='utf-8') as f:

f.write(ncx_content)

print("Đã cập nhật id heading, tạo trang mục lục và toc.ncx!")

  

if __name__ == "__main__":

main()

🪴ttuan's garden

Cách làm Ebook

Biên tập file word

Biên tập file epub

Python Scripts

References

Graph View

Table of Contents

Backlinks